Predicting Attributes/Demographics
April 7th, 2018
When attempting to identify specific attributes or demographics, we often resort to machine learning techniques to predict the value. Basic probability can often be used to predict values from aggregate statistics to estimate a value
Overview🔗
It’s common practice for websites,web analytics tools (like Google Analytics Demographics & Interests), and recommendation systems to collect information from users to create user demographics for analysis, recommending content, and audience targeting. However, this raises a couple of problems for different types of businesses:
- These services do not indicate what data was used to make the prediction, nor provide a measure of uncertainty for the predictions.
- Most websites do not have enough visitors to get a decent sample of their audience, which makes analysis difficult and audience segmentation nearly impossible.
- For websites attempting to attract new visitors, discovering information about new visitors can be problematic. When a visitor comes to their website, they know little to nothing about them and over the course of their visit they learn more about them. How can they provide their visitor with content they will find most satisfying when they do not have much (or any data) about them?
- Large scale websites can have issues maximizing use of their pages and their data. With many landing pages, new visitors can arrive to their website at many landing pages, many of which can provide different types of data on the user.. However, sites cannot predict which data points they will get about a user, due to the wide variety of entry points to the website. How can they progressively enhance their recommendations and user profile in real-time when they do not know where the user will land on their website, or the order in which they will receive information?
These problems can be mitigated by taking user data from and combining it with relevant statistics with data with larger sample sizes from reputable sources. Statistics based on a larger sample size can provide a more accurate measure regarding segments in a larger setting. While this measure also has flaws, it can provide a starting point for analysis and content recommendation.
I will start with of how to get probabilities from summary statistics of external data. Then cover how to use this output data for probability-based fuzzy segmentation, and to dynamically predict atttributes (in this case, demographics). These example will all be based on athletics demographics and segmentation.
Probability from External Data🔗
The data sets we will be using are sampled from the Running USA National Runner Survey and the Datalys publications, specifically those from the Journal of Athletic Training publication Epidemiology of National Collegiate Athletic Association Men’s and Women’s Cross-Country Injuries, 2009–2010 Through 2013–2014.
The data is divided into 2 files:
- A
total_statistics.csv
file, which contains the top level statistics about each group - A
runner_type_statistics.csv
file, which contains probabilities for the same fields, but broken down by runner types (which will be covered shortly)
The total_statistics.csv
file contains the following data (truncated for brevity):
category | name | probability |
---|---|---|
Gender | Female | 0.73 |
Gender | Male | 0.27 |
The `` file contains data in the following format (again, truncated for brevity):
category | runner type | name | probability |
---|---|---|---|
Health | serious & competitive | Excellent | 0.52 |
Health | frequent/fitness | Excellent | 0.2 |
Health | walker/jogger/recreational | Excellent | 0.2 |
Health | obstacle event participant | Excellent | 0.19 |
Please use these samples as a reference for the Python code samples later in this post.
We will be loading these files into Pandas DataFrames
to work with the data more efficiently and to reduce the boilerplate code to demonstrate the concept:
For each of the category, name pairings in total_statistics.csv
, we will calculate the probability of being of a particular runner type
or a particular gender
given that a person has a particular category-name pairing. We already know the likelihood of a person of a given gender or certain type of runner having a particular demographic.
We know the probability of a person fits into a particular demographic given they are a particular gender/certain type of runner. But what about the other way around? Given that a person fits in a particular demographic, what is the likelihood that the person is a particular gender/is a certain type of runner?
The following script breaks down the data from both files using Baye’s Theorem to calculate the probability of
The printed statements are the output probabilities in the format P( <unknown variable/variable being predicted> | <known variable/demographic> )
In order to convert probabilities to classifications, we need to set a probability threshold for the probability of category types. Setting a lower threshold means that we will get more information, but the information may be less reliable, but a higher threshold will provide less information, but the information will be more reliable. You may have to experiment with a threshold based on your acceptable level of risk. If no category types are above this threshold, we can conclude that we do not yet have enough information:
Estimating demographics using larger datasets🔗
Small businesses and startups have to find their audiences in order to know who to market their products/services to. Many companies like MileStat, Flash Results, Inc., and RunCoach have niche audiences that their services are catered towards. But however small a niche is, they still need to identify audiences with highest conversion rates, and the ones who are not converting at all. Finding who the converting audiences are is possibly the easier of the two groups, this is because many of these services request demographic information from users when setting up their services, like gender, age, physical activity level, etc. This information is generally more trustworthy and accurate, because the user expects for this information to be used to provide a more useful service for them. Finding who the users are that are not converting is the harder problem. These visitors may not explicitly provide information to the company, and if they do they do not have incentives to do so truthfully. Business analysts have to manage this uncertainty in their predictions and classifications, which larger data sets help manage. These data sets should be a fairly accurate distribution of the total population, which (in the absence of significant customer data) can act as a crude but effective way of predicting visitor demographics. It does not matter that the distribution may not be exact, so long as the correlations are close to those of the actual population. While this method has its flaws its meant to be a temporary complement to a growing, but currently small, data set
Here is a relatively common example A race results website is trying to identify who their most active audiences are, but they have no visitor logins and only recently deployed tracking tools like Google Tag Manager / Google Analytics. Many smaller content-driven websites are in the same boat: Boatloads of content, but no way of finding out who is reading and who is bouncing from their website. There are a couple high-level demographics that will be useful for this website:
- Gender
- Role: Race Official, Coach, Athlete or Enthusiast
- Event Preferences: Distance running vs Sprints vs Field events
- Team Association (if any or if applicable)
We can use the Running USA 2012 Survey to enhance our predictions. But before we dive into solving our problem, let’s explore the data.
category | subcategory | dimension | dimension name | name | probability |
---|---|---|---|---|---|
demographics | gender | all | all | female | 0.37 |
demographics | gender | all | all | male | 0.63 |
demographics | age group | all | all | 18-24 | 0.04 |
demographics | age group | all | all | 25-34 | 0.19 |
demographics | age group | all | all | 35-44 | 0.28 |
demographics | age group | all | all | 45-54 | 0.27 |
demographics | age group | all | all | 55-64 | 0.17 |
demographics | age group | all | all | 65-74 | 0.05 |
demographics | age group | all | all | >75 | 0.01 |
demographics | marital status | all | all | married | 0.68 |
demographics | marital status | all | all | single, never married | 0.2 |
demographics | marital status | all | all | divorced | 0.08 |
demographics | marital status | all | all | domestic partner | 0.03 |
demographics | marital status | all | all | widowed | 0.01 |
demographics | marital status | all | all | separated | 0.01 |
demographics | household composition | all | all | 1 | 0.14 |
demographics | household composition | all | all | 2 | 0.32 |
demographics | household composition | all | all | 3 | 0.16 |
demographics | household composition | all | all | 4 | 0.23 |
demographics | household composition | all | all | >5 | 0.14 |
demographics | children under 19 | all | all | 0 | 0.52 |
demographics | children under 19 | all | all | 1 | 0.16 |
demographics | children under 19 | all | all | 2 | 0.22 |
demographics | children under 19 | all | all | 3 | 0.08 |
demographics | children under 19 | all | all | >4 | 0.03 |
demographics | education | all | all | Currently full-time student and under 19 | 0.04 |
demographics | education | all | all | Attended college 1-3 years | 0.08 |
demographics | education | all | all | Associate’s degree | 0.06 |
demographics | education | all | all | Technical or Trade degree | 0.02 |
demographics | education | all | all | Graduated from 4-year college | 0.36 |
demographics | education | all | all | Post-grad study without degree | 0.06 |
demographics | education | all | all | Master’s degree | 0.29 |
demographics | education | all | all | Doctoral degree | 0.08 |
demographics | employment | all | all | Employed full-time | 0.87 |
demographics | employment | all | all | student | 0.03 |
demographics | employment | all | all | retired | 0.03 |
demographics | employment | all | all | homemaker | 0.06 |
demographics | employment | all | all | unemployed | 0.02 |
demographics | ethnicity | all | all | white/caucasian | 0.84 |
demographics | ethnicity | all | all | black/african american | 0.07 |
demographics | ethnicity | all | all | asian/pacific islander | 0.06 |
demographics | ethnicity | all | all | hispanic | 0.04 |
demographics | ethnicity | all | all | american indian | 0.01 |
demographics | ethnicity | all | all | other | 0.02 |
demographics | annual income (employed) | all | all | < $15,000 | 0.03 |
demographics | annual income (employed) | all | all | $15,000-$24,999 | 0.03 |
demographics | annual income (employed) | all | all | $25,000-$34,999 | 0.04 |
demographics | annual income (employed) | all | all | $35,000-$49,999 | 0.12 |
demographics | annual income (employed) | all | all | $50,000-$74,999 | 0.24 |
demographics | annual income (employed) | all | all | $75,000-$99,999 | 0.15 |
demographics | annual income (employed) | all | all | $100,000-$124,999 | 0.1 |
demographics | annual income (employed) | all | all | $125,000-$149,999 | 0.04 |
demographics | annual income (employed) | all | all | $150,000-$174,999 | 0.03 |
demographics | annual income (employed) | all | all | $175,000-$199,999 | 0.01 |
demographics | annual income (employed) | all | all | >$200,000 | 0.04 |
demographics | annual income (employed) | all | all | Don’t Know/Refused | 0.17 |
demographics | runner type | all | all | Jogger / Recreational Runner | 0.21 |
demographics | runner type | all | all | Serious & Competitive Runner | 0.15 |
demographics | runner type | all | all | Frequent / Fitness Runner | 0.62 |
habits | time ran per year | all | all | 12 months / year | 0.76 |
habits | time ran per week | all | all | >= 4 days / week | 0.56 |
motivators | start running | all | all | for exercise | 0.24 |
motivators | start running | all | all | weight concerns | 0.14 |
motivators | start running | all | all | Friend / family encouragement | 0.09 |
motivators | start running | all | all | To enter a race | 0.08 |
motivators | start running | all | all | Competed in school and never stopped | 0.08 |
motivators | start running | all | all | Because I enjoy it | 0.06 |
motivators | start running | all | all | Needed a new challenge | 0.06 |
motivators | start running | all | all | To relieve stress | 0.06 |
motivators | start running | all | all | Health concerns besides weight | 0.05 |
motivators | start running | all | all | To get in shape for another sport | 0.03 |
motivators | continue to run | all | all | staying healthy | 0.77 |
motivators | continue to run | all | all | staying in shape | 0.73 |
motivators | continue to run | all | all | relieving stress | 0.62 |
motivators | continue to run | all | all | to enter / train for a race | 0.62 |
motivators | continue to run | all | all | having fun | 0.55 |
motivators | continue to run | all | all | achieving a goal | 0.54 |
motivators | continue to run | all | all | meeting a personal challenge | 0.53 |
motivators | continue to run | all | all | improving my state pf mind | 0.53 |
motivators | continue to run | all | all | controlling my weight | 0.53 |
motivators | continue to run | all | all | improving speed or endurance | 0.46 |
motivators | continue to run | all | all | socializing with friends/family/others | 0.41 |
motivators | continue to run | all | all | appreciating nature, scenery | 0.39 |
motivators | continue to run | all | all | being by myself for awhile | 0.39 |
motivators | continue to run | all | all | getting into the natural environment | 0.31 |
motivators | continue to run | all | all | competing against others | 0.22 |
motivators | continue to run | all | all | stay injury free | 0.13 |
motivators | goals | all | all | stay healthy | 0.68 |
motivators | goals | all | all | set a new PR | 0.43 |
motivators | goals | all | all | Run a new race | 0.4 |
motivators | goals | all | all | lose weight | 0.32 |
motivators | goals | all | all | run a marathon | 0.29 |
motivators | goals | all | all | keep a running streak | 0.13 |
motivators | goals | all | all | Run 2017+ miles | 0.05 |
motivators | goals | all | all | Run 1+ mile/day | 0.03 |
activity | regular activities as part of running | all | all | warm up/stretch | 0.55 |
activity | regular activities as part of running | all | all | use machines/weight | 0.49 |
activity | regular activities as part of running | all | all | walk | 0.47 |
activity | regular activities as part of running | all | all | bike | 0.27 |
activity | regular activities as part of running | all | all | yoga/pilates | 0.26 |
activity | regular activities as part of running | all | all | hike | 0.21 |
activity | regular activities as part of running | all | all | HIIT Training | 0.21 |
activity | regular activities as part of running | all | all | Spin | 0.18 |
activity | regular activities as part of running | all | all | Swim | 0.14 |
activity | regular activities as part of running | all | all | Aerobics | 0.11 |
activity | regular activities as part of running | all | all | Crossfit | 0.1 |
activity | regular activities as part of running | all | all | Form Drills | 0.09 |
preferences | time to run | all | all | early AM | 0.64 |
preferences | time to run | all | all | Mid-morning | 0.25 |
preferences | time to run | all | all | Noon | 0.06 |
preferences | time to run | all | all | Early afternoon | 0.06 |
preferences | time to run | all | all | Mid-afternoon | 0.11 |
preferences | time to run | all | all | Early evening | 0.39 |
preferences | time to run | all | all | Late evening | 0.09 |
preferences | regular running workouts | all | all | easy runs (aerobic) | 0.86 |
preferences | regular running workouts | all | all | long runs (>1 hour) | 0.79 |
preferences | regular running workouts | all | all | hill training | 0.44 |
preferences | regular running workouts | all | all | pace workouts | 0.44 |
preferences | regular running workouts | all | all | tempo runs | 0.38 |
preferences | regular running workouts | all | all | recovery runs | 0.31 |
preferences | regular running workouts | all | all | fartlek | 0.17 |
preferences | regular running workouts | all | all | pickups | 0.08 |
preferences | social runs | all | all | alone | 0.55 |
preferences | social runs | all | all | with one other person | 0.16 |
preferences | social runs | all | all | in a group | 0.14 |
preferences | social runs | all | all | no preference | 0.16 |
preferences | venue/surface | all | all | paved path | 0.67 |
preferences | venue/surface | all | all | urban road | 0.54 |
preferences | venue/surface | all | all | park | 0.47 |
preferences | venue/surface | all | all | rural road | 0.44 |
preferences | venue/surface | all | all | dirt trail | 0.4 |
preferences | venue/surface | all | all | outdoor track | 0.2 |
preferences | venue/surface | all | all | mountains | 0.16 |
preferences | venue/surface | all | all | treadmill | 0.14 |
preferences | venue/surface | all | all | beach | 0.12 |
preferences | season | all | all | fall | 0.46 |
preferences | season | all | all | winter | 0.1 |
preferences | season | all | all | spring | 0.34 |
preferences | season | all | all | summer | 0.11 |
preferences | prevent running outside | all | all | too icy | 0.69 |
preferences | prevent running outside | all | all | don’t feel safe | 0.48 |
preferences | prevent running outside | all | all | windchill too low | 0.44 |
preferences | prevent running outside | all | all | too rainy | 0.35 |
preferences | prevent running outside | all | all | too dark | 0.34 |
preferences | prevent running outside | all | all | too cold | 0.31 |
preferences | prevent running outside | all | all | too hot | 0.31 |
preferences | prevent running outside | all | all | nothing | 0.07 |
preferences | group involvement | all | all | Local running club | 0.34 |
preferences | group involvement | all | all | Social groups / informal groups | 0.29 |
preferences | group involvement | all | all | running store group runs | 0.18 |
preferences | group involvement | all | all | virtual running club / challenge group | 0.11 |
preferences | group involvement | all | all | meetup running group | 0.06 |
preferences | group involvement | all | all | rrca | 0.06 |
preferences | group involvement | all | all | usa track & field | 0.04 |
preferences | group involvement | all | all | none | 0.4 |
preferences | carrying items | all | all | personal ID | 0.34 |
preferences | carrying items | all | all | fitness tracker | 0.3 |
preferences | carrying items | all | all | water bottle | 0.28 |
preferences | carrying items | all | all | energy bars/gel | 0.24 |
preferences | carrying items | all | all | portable audio sysem/ipod/MP3 | 0.23 |
preferences | carrying items | all | all | hydration accessories (belt, backpack) | 0.21 |
preferences | carrying items | all | all | Spibelt (or similar waist belt) | 0.21 |
preferences | carrying items | all | all | sunscreen | 0.2 |
preferences | carrying items | all | all | reflective gear | 0.19 |
preferences | carrying items | all | all | heart rate monitor | 0.18 |
preferences | carrying items | all | all | cash/credit card | 0.17 |
preferences | carrying items | all | all | chapstick/lipgloss | 0.16 |
preferences | carrying items | all | all | compression gear | 0.15 |
preferences | carrying items | all | all | sports drink | 0.09 |
preferences | carrying items | all | all | sweatband | 0.08 |
preferences | carrying items | all | all | dog | 0.07 |
preferences | carrying items | all | all | pepper spray | 0.07 |
preferences | carrying items | all | all | sleeves | 0.04 |
preferences | carrying items | all | all | toilet paper | 0.04 |
preferences | carrying items | all | all | inhaler | 0.04 |
racing | runner type | all | all | competitor | 0.56 |
racing | runner type | all | all | fun runner | 0.52 |
racing | runner type | all | all | fitness participant | 0.37 |
racing | runner type | all | all | outdoor enthusiast | 0.15 |
racing | Preferred distance | all | all | 5k | 0.14 |
racing | Preferred distance | all | all | 10k | 0.18 |
racing | Preferred distance | all | all | 12k, 15k, 10 mile | 0.06 |
racing | Preferred distance | all | all | half marathon | 0.43 |
racing | Preferred distance | all | all | marathon | 0.1 |
racing | entered last 2 years | all | all | 5k | 0.82 |
racing | entered last 2 years | all | all | half marathon | 0.8 |
racing | entered last 2 years | all | all | 10k | 0.67 |
racing | entered last 2 years | all | all | marathon | 0.43 |
racing | entered last 2 years | all | all | 12k, 15k, 10 mile | 0.37 |
racing | entered last 2 years | all | all | 4 mile, 8k, or 5 mile | 0.33 |
racing | entered last 2 years | all | all | fun run or untimed run | 0.24 |
racing | entered last 2 years | all | all | trail race | 0.23 |
racing | entered last 2 years | all | all | road running relay | 0.13 |
racing | entered last 2 years | all | all | triathlon/duathlon | 0.12 |
racing | entered last 2 years | all | all | mud/obstacle | 0.1 |
racing | entered last 2 years | all | all | color | 0.08 |
racing | entered last 2 years | all | all | 1 mile or 2 mile | 0.11 |
racing | entered last 2 years | all | all | Glow / night | 0.08 |
racing | entered last 2 years | all | all | 20k, 25k, or 30k | 0.07 |
racing | entered last 2 years | all | all | ultra distance | 0.08 |
racing | entered last 2 years | all | all | untimed walk event | 0.06 |
racing | entered last 2 years | all | all | Cross-country race | 0.05 |
racing | most interested in upcoming year | all | all | 5k | 0.56 |
racing | most interested in upcoming year | all | all | half marathon | 0.75 |
racing | most interested in upcoming year | all | all | 10k | 0.56 |
racing | most interested in upcoming year | all | all | marathon | 0.41 |
racing | most interested in upcoming year | all | all | 12k, 15k, 10 mile | 0.3 |
racing | most interested in upcoming year | all | all | 4 mile, 8k, or 5 mile | 0.21 |
racing | most interested in upcoming year | all | all | fun run or untimed run | 0.08 |
racing | most interested in upcoming year | all | all | trail race | 0.23 |
racing | most interested in upcoming year | all | all | road running relay | 0.12 |
racing | most interested in upcoming year | all | all | triathlon/duathlon | 0.13 |
racing | most interested in upcoming year | all | all | mud/obstacle | 0.1 |
racing | most interested in upcoming year | all | all | color | 0.05 |
racing | most interested in upcoming year | all | all | 1 mile or 2 mile | 0.06 |
racing | most interested in upcoming year | all | all | Glow / night | 0.07 |
racing | most interested in upcoming year | all | all | 20k, 25k, or 30k | 0.08 |
racing | most interested in upcoming year | all | all | ultra distance | 0.1 |
racing | most interested in upcoming year | all | all | untimed walk event | 0.02 |
racing | most interested in upcoming year | all | all | Cross-country race | 0.05 |
racing | holiday event | all | all | thanksgiving | 0.54 |
racing | holiday event | all | all | fourth of july | 0.26 |
racing | holiday event | all | all | st patrick’s day | 0.2 |
racing | holiday event | all | all | new years | 0.18 |
racing | holiday event | all | all | christmas | 0.15 |
racing | holiday event | all | all | halloween | 0.13 |
racing | holiday event | all | all | valentine’s day | 0.08 |
racing | participation change last year | all | all | decrease | 0.2 |
racing | participation change last year | all | all | same | 0.5 |
racing | participation change last year | all | all | increase | 0.3 |
racing | Planned participation change next year | all | all | decrease | 0.08 |
racing | Planned participation change next year | all | all | same | 0.56 |
racing | Planned participation change next year | all | all | increase | 0.36 |
racing | Half-marathons completed | all | all | 0 | 0.1 |
racing | Half-marathons completed | all | all | 1 | 0.08 |
racing | Half-marathons completed | all | all | 2 | 0.07 |
racing | Half-marathons completed | all | all | 3-5 | 0.16 |
racing | Half-marathons completed | all | all | 6-10 | 0.2 |
racing | Half-marathons completed | all | all | 11-15 | 0.13 |
racing | Half-marathons completed | all | all | 16-20 | 0.09 |
racing | Half-marathons completed | all | all | 21-25 | 0.04 |
racing | Half-marathons completed | all | all | >=25 | 0.13 |
racing | marathons completed | all | all | 0 | 0.38 |
racing | marathons completed | all | all | 1 | 0.15 |
racing | marathons completed | all | all | 2 | 0.08 |
racing | marathons completed | all | all | 3-5 | 0.14 |
racing | marathons completed | all | all | 6-10 | 0.1 |
racing | marathons completed | all | all | 11-15 | 0.05 |
racing | marathons completed | all | all | 16-20 | 0.03 |
racing | marathons completed | all | all | 21-25 | 0.01 |
racing | marathons completed | all | all | >=25 | 0.05 |
racing | Top factors effecting participation | all | all | preferred distance | 0.81 |
racing | Top factors effecting participation | all | all | date of event | 0.78 |
racing | Top factors effecting participation | all | all | location is convenient | 0.7 |
racing | Top factors effecting participation | all | all | have time to train | 0.67 |
racing | Top factors effecting participation | all | all | health/injury | 0.67 |
racing | Top factors effecting participation | all | all | sounds fun | 0.66 |
racing | Top factors effecting participation | all | all | chip timed | 0.65 |
racing | Top factors effecting participation | all | all | scenic course | 0.61 |
racing | Top factors effecting participation | all | all | cost/entry fee | 0.59 |
racing | Top factors effecting participation | all | all | reputation of event organizers | 0.58 |
racing | Other factors effecting participation | all | all | accurate, vertified course | 0.52 |
racing | Other factors effecting participation | all | all | medal or other momento for finishers | 0.49 |
racing | Other factors effecting participation | all | all | quality t-shirt | 0.44 |
racing | Other factors effecting participation | all | all | fun post-race experience | 0.38 |
racing | Other factors effecting participation | all | all | it benefits an important cause | 0.35 |
racing | Other factors effecting participation | all | all | no crowds/traffic/hassles expected | 0.34 |
racing | Other factors effecting participation | all | all | my friends are doing it | 0.34 |
racing | Other factors effecting participation | all | all | promise of a unique event | 0.31 |
racing | Other factors effecting participation | all | all | It is an event I participated In before | 0.29 |
racing | Other factors effecting participation | all | all | Fast course | 0.26 |
racing | Other factors effecting participation | all | all | free race photos or videos | 0.24 |
racing | Other factors effecting participation | all | all | entertainment on course or finish | 0.24 |
racing | Other factors effecting participation | all | all | good age group awards | 0.19 |
racing | Other factors effecting participation | all | all | sustainable event/ has an environmental initiative | 0.18 |
racing | Other factors effecting participation | all | all | there Is an expo | 0.15 |
racing | Other factors effecting participation | all | all | qualifier | 0.15 |
racing | Other factors effecting participation | all | all | recycled/sustainable race t-shirt | 0.12 |
racing | Other factors effecting participation | all | all | something offered to other family members | 0.11 |
racing | Other factors effecting participation | all | all | it is a new event | 0.1 |
racing | Other factors effecting participation | all | all | has a social media app / site for sharing experience | 0.1 |
racing | Other factors effecting participation | all | all | random participant awards | 0.1 |
racing | Other factors effecting participation | all | all | appropriate training group is available | 0.06 |
racing | Other factors effecting participation | all | all | race Is part of a local grand prix | 0.05 |
racing | Other factors effecting participation | all | all | elite runners in the field | 0.05 |
racing | Other factors effecting participation | all | all | expo has guest speakers I am interested In meeting | 0.04 |
racing | Other factors effecting participation | all | all | Women only event | 0.04 |
racing | primary source of information | all | all | 0.43 | |
racing | primary source of information | all | all | word of mouth | 0.37 |
racing | primary source of information | all | all | individual race website | 0.35 |
racing | primary source of information | all | all | registration website | 0.27 |
racing | primary source of information | all | all | running store group runs | 0.23 |
racing | primary source of information | all | all | local club / city website | 0.18 |
racing | primary source of information | all | all | race calendar apps | 0.18 |
racing | primary source of information | all | all | national website | 0.15 |
racing | primary source of information | all | all | Expos | 0.11 |
racing | primary source of information | all | all | Regional / state website | 0.1 |
racing | primary source of information | all | all | national magazine | 0.09 |
racing | primary source of information | all | all | local publications | 0.09 |
racing | primary source of information | all | all | 0.05 | |
racing | primary source of information | all | all | state or regional publication | 0.04 |
racing | primary source of information | all | all | 0.04 | |
racing | primary source of information | all | all | GroupOn / Living Social / Rush 49 | 0.02 |
racing | Attitudes & behaviors | all | all | I prefer a tech finisher t-shirt to a cotton finisher t-shirt | 0.73 |
racing | Attitudes & behaviors | all | all | It is easy to find an event I want to participate in | 0.73 |
racing | Attitudes & behaviors | all | all | I would participate in more events if entry feeds were lower | 0.62 |
racing | Attitudes & behaviors | all | all | I prefer traditional to non-traditional (i.e. mud, obstacle, color) running events | 0.58 |
racing | Attitudes & behaviors | all | all | Race fees are too expensive | 0.56 |
racing | Attitudes & behaviors | all | all | I receive good value for my race fee | 0.51 |
racing | Attitudes & behaviors | all | all | I like participating in the same events every year | 0.49 |
racing | Attitudes & behaviors | all | all | I am always looking for a new event experience | 0.48 |
racing | Attitudes & behaviors | all | all | I like to share my race experience with others via social media | 0.45 |
racing | Attitudes & behaviors | all | all | I wish races offered something other than a finisher t-shirt for SWAG | 0.4 |
racing | Attitudes & behaviors | all | all | you should get a race finisher t-shirt for finishing shorter distances like a 5k or 10k | 0.33 |
racing | Attitudes & behaviors | all | all | Social media is my first choice for event information | 0.28 |
racing | Attitudes & behaviors | all | all | You should only get a race finisher medal for a half-marathon or marathon | 0.27 |
racing | Attitudes & behaviors | all | all | I prefer larger races to smaller races | 0.26 |
racing | Attitudes & behaviors | all | all | Race medals are getting too big | 0.23 |
racing | Attitudes & behaviors | all | all | I like to take pictures while I am participating in an event | 0.2 |
racing | Attitudes & behaviors | all | all | There are too many events to choose from | 0.16 |
racing | Attitudes & behaviors | all | all | I would pay more for a VIP race experience (race day packet pickup, access to special porto potties, front on the starting line access, etc) | 0.15 |
racing | Attitudes & behaviors | all | all | I don’t care about my race time | 0.12 |
racing | Attitudes & behaviors | all | all | I would prefer to participate in events as a group vs individually | 0.11 |
racing | Attitudes & behaviors | all | all | I like events where I can dress up in a costume / there is a theme | 0.09 |
racing | Attitudes & behaviors | all | all | I prefer untimed events to times events | 0.03 |
social media | follows | all | all | Running stores | 0.4 |
social media | follows | all | all | Running brands | 0.37 |
social media | follows | all | all | Local runners | 0.32 |
social media | follows | all | all | Elite runners | 0.29 |
social media | follows | all | all | Bloggers | 0.19 |
social media | follows | all | all | Celebrities | 0.05 |
social media | follows | all | all | None of these | 0.16 |
social media | follows | all | all | I don’t follow any running-related accounts | 0.21 |
technology | app usage | all | all | I like to have / track all of my running statistics | 0.82 |
technology | app usage | all | all | It helps me train better | 0.65 |
technology | app usage | all | all | It makes me feel good to see what I ran | 0.61 |
technology | app usage | all | all | I like to share the information with others on social media | 0.17 |
technology | app usage | all | all | I like to see how I compare to other runners | 0.15 |
technology | app usage | all | all | I like to share the information with a running coach | 0.07 |
technology | app usage | all | all | Everyone else is using them so I do too | 0.02 |
technology | tracking | device | Phone / app on phone | track mileage | 0.45 |
technology | tracking | device | Phone / app on phone | GPS | 0.36 |
technology | tracking | device | Phone / app on phone | track nutrition / calories | 0.29 |
technology | tracking | device | Phone / app on phone | track steps | 0.21 |
technology | tracking | device | Phone / app on phone | map routes | 0.38 |
technology | tracking | device | Phone / app on phone | play music | 0.55 |
technology | tracking | device | Phone / app on phone | training programs | 0.22 |
technology | tracking | device | Phone / app on phone | tracking workouts | 0.4 |
technology | tracking | device | Phone / app on phone | interval training | 0.17 |
technology | tracking | device | Phone / app on phone | virtual coach | 0.11 |
technology | tracking | device | Phone / app on phone | none of these | 0.08 |
technology | tracking | device | watch | track mileage | 0.48 |
technology | tracking | device | watch | GPS | 0.52 |
technology | tracking | device | watch | track nutrition / calories | 0.07 |
technology | tracking | device | watch | track steps | 0.28 |
technology | tracking | device | watch | map routes | 0.17 |
technology | tracking | device | watch | play music | 0.04 |
technology | tracking | device | watch | training programs | 0.08 |
technology | tracking | device | watch | tracking workouts | 0.34 |
technology | tracking | device | watch | interval training | 0.27 |
technology | tracking | device | watch | virtual coach | 0.04 |
technology | tracking | device | watch | none of these | 0.08 |
technology | tracking | device | wearable tracking device | track mileage | 0.23 |
technology | tracking | device | wearable tracking device | GPS | 0.16 |
technology | tracking | device | wearable tracking device | track nutrition / calories | 0.05 |
technology | tracking | device | wearable tracking device | track steps | 0.26 |
technology | tracking | device | wearable tracking device | map routes | 0.08 |
technology | tracking | device | wearable tracking device | play music | 0.04 |
technology | tracking | device | wearable tracking device | training programs | 0.04 |
technology | tracking | device | wearable tracking device | tracking workouts | 0.17 |
technology | tracking | device | wearable tracking device | interval training | 0.09 |
technology | tracking | device | wearable tracking device | virtual coach | 0.02 |
technology | tracking | device | wearable tracking device | none of these | 0.11 |
technology | tracking | device | online website | track mileage | 0.12 |
technology | tracking | device | online website | GPS | 0.04 |
technology | tracking | device | online website | track nutrition / calories | 0.06 |
technology | tracking | device | online website | track steps | 0.02 |
technology | tracking | device | online website | map routes | 0.24 |
technology | tracking | device | online website | play music | 0.01 |
technology | tracking | device | online website | training programs | 0.18 |
technology | tracking | device | online website | tracking workouts | 0.12 |
technology | tracking | device | online website | interval training | 0.04 |
technology | tracking | device | online website | virtual coach | 0.04 |
technology | tracking | device | online website | none of these | 0.09 |
social media | activity | platform | all | fundraise for a charity event | 0.62 |
social media | activity | platform | all | discuss running-related activities | 0.62 |
social media | activity | platform | all | look for running motivation | 0.49 |
social media | activity | platform | all | recruit others to join me at an upcoming race | 0.49 |
social media | activity | platform | all | share your current training | 0.47 |
social media | activity | platform | all | communicate with training partners | 0.43 |
social media | activity | platform | all | recruit others to train with you | 0.42 |
social media | activity | platform | all | post your race results | 0.42 |
social media | activity | platform | all | follow events | 0.4 |
social media | activity | platform | all | post general running photos and videos | 0.38 |
social media | activity | platform | all | look for running training advice | 0.37 |
social media | activity | platform | all | track friends / family in a race | 0.33 |
social media | activity | platform | all | follow other runners (non-professional) | 0.31 |
social media | activity | platform | all | Follow professional runners | 0.29 |
social media | activity | platform | all | Post your mileage / running routes | 0.29 |
social media | activity | platform | all | Post race photos and videos | 0.26 |
social media | activity | platform | all | None of these | 0.15 |
social media | activity | platform | fundraise for a charity event | 0.07 | |
social media | activity | platform | discuss running-related activities | 0.11 | |
social media | activity | platform | look for running motivation | 0.06 | |
social media | activity | platform | recruit others to join me at an upcoming race | 0.06 | |
social media | activity | platform | share your current training | 0.06 | |
social media | activity | platform | communicate with training partners | 0.08 | |
social media | activity | platform | recruit others to train with you | 0.04 | |
social media | activity | platform | post your race results | 0.09 | |
social media | activity | platform | follow events | 0.05 | |
social media | activity | platform | post general running photos and videos | 0.06 | |
social media | activity | platform | look for running training advice | 0.03 | |
social media | activity | platform | track friends / family in a race | 0.05 | |
social media | activity | platform | follow other runners (non-professional) | 0.02 | |
social media | activity | platform | Follow professional runners | 0.04 | |
social media | activity | platform | Post your mileage / running routes | 0.04 | |
social media | activity | platform | Post race photos and videos | 0.11 | |
social media | activity | platform | None of these | 0.22 | |
social media | activity | platform | fundraise for a charity event | 0.26 | |
social media | activity | platform | discuss running-related activities | 0.16 | |
social media | activity | platform | look for running motivation | 0.21 | |
social media | activity | platform | recruit others to join me at an upcoming race | 0.16 | |
social media | activity | platform | share your current training | 0.1 | |
social media | activity | platform | communicate with training partners | 0.2 | |
social media | activity | platform | recruit others to train with you | 0.07 | |
social media | activity | platform | post your race results | 0.2 | |
social media | activity | platform | follow events | 0.07 | |
social media | activity | platform | post general running photos and videos | 0.09 | |
social media | activity | platform | look for running training advice | 0.06 | |
social media | activity | platform | track friends / family in a race | 0.12 | |
social media | activity | platform | follow other runners (non-professional) | 0.05 | |
social media | activity | platform | Follow professional runners | 0.1 | |
social media | activity | platform | Post your mileage / running routes | 0.06 | |
social media | activity | platform | Post race photos and videos | 0.18 | |
social media | activity | platform | None of these | 0.2 | |
social media | activity | platform | fundraise for a charity event | 0.01 | |
social media | activity | platform | discuss running-related activities | 0.01 | |
social media | activity | platform | look for running motivation | 0.01 | |
social media | activity | platform | recruit others to join me at an upcoming race | 0 | |
social media | activity | platform | share your current training | 0.01 | |
social media | activity | platform | communicate with training partners | 0.13 | |
social media | activity | platform | recruit others to train with you | 0 | |
social media | activity | platform | post your race results | 0.01 | |
social media | activity | platform | follow events | 0.01 | |
social media | activity | platform | post general running photos and videos | 0.08 | |
social media | activity | platform | look for running training advice | 0 | |
social media | activity | platform | track friends / family in a race | 0 | |
social media | activity | platform | follow other runners (non-professional) | 0 | |
social media | activity | platform | Follow professional runners | 0 | |
social media | activity | platform | Post your mileage / running routes | 0 | |
social media | activity | platform | Post race photos and videos | 0.01 | |
social media | activity | platform | None of these | 0.24 | |
social media | activity | platform | fundraise for a charity event | 0.07 | |
social media | activity | platform | discuss running-related activities | 0.07 | |
social media | activity | platform | look for running motivation | 0.09 | |
social media | activity | platform | recruit others to join me at an upcoming race | 0.11 | |
social media | activity | platform | share your current training | 0.11 | |
social media | activity | platform | communicate with training partners | 0.1 | |
social media | activity | platform | recruit others to train with you | 0.13 | |
social media | activity | platform | post your race results | 0.11 | |
social media | activity | platform | follow events | 0.14 | |
social media | activity | platform | post general running photos and videos | 0.13 | |
social media | activity | platform | look for running training advice | 0.15 | |
social media | activity | platform | track friends / family in a race | 0.15 | |
social media | activity | platform | follow other runners (non-professional) | 0.17 | |
social media | activity | platform | Follow professional runners | 0.16 | |
social media | activity | platform | Post your mileage / running routes | 0.17 | |
social media | activity | platform | Post race photos and videos | 0.15 | |
social media | activity | platform | None of these | 0.12 | |
technology | tracking | all | all | App on my phone | 0.63 |
technology | tracking | all | all | hard copy log or journal | 0.23 |
technology | tracking | all | all | computer software | 0.15 |
technology | tracking | all | all | online software | 0.14 |
technology | tracking | all | all | None, do not track at all | 0.09 |
preferences | general | all | all | Run in a lower-cost, no frills, less swag race | 0.55 |
preferences | general | all | all | Run in a higher-cost, fuller experience, more swag race | 0.45 |
preferences | general | all | all | Run outside during harsher weather | 0.57 |
preferences | general | all | all | Run inside during harsher weather | 0.43 |
preferences | general | all | all | Only run as exercise | 0.23 |
preferences | general | all | all | Supplement running with other exercise | 0.77 |
preferences | general | all | all | Stretch | 0.71 |
preferences | general | all | all | Not stretch | 0.29 |
preferences | general | all | all | Run on trails | 0.37 |
preferences | general | all | all | Run on the roads | 0.64 |
preferences | general | all | all | Run with a watch or tracking device | 0.93 |
preferences | general | all | all | Run without a watch or tracking device | 0.07 |
preferences | general | all | all | Run with music | 0.6 |
preferences | general | all | all | Run without music | 0.4 |
preferences | general | all | all | Share your running with others on social media | 0.35 |
preferences | general | all | all | Keep my running to myself and close friends | 0.65 |
learning | general | all | all | Best places to run when on vacation | 0.49 |
learning | general | all | all | Best places to run in your area | 0.47 |
learning | general | all | all | How to avoid injuries | 0.46 |
learning | general | all | all | How to cross train to supplement your running | 0.42 |
learning | general | all | all | Easy ways to find races to partipicate in | 0.38 |
learning | general | all | all | What to eat before a big race | 0.35 |
learning | general | all | all | How to select the best running shoes | 0.34 |
learning | general | all | all | What pace you should run when racing | 0.31 |
learning | general | all | all | How to be safe when running | 0.26 |
learning | general | all | all | How to run in inclement weather | 0.2 |
learning | general | all | all | How to find a good running coach | 0.14 |
learning | general | all | all | How to run at night | 0.12 |
injuries | Past 12 months | all | all | blisters | 0.29 |
injuries | Past 12 months | all | all | knee | 0.22 |
injuries | Past 12 months | all | all | hips | 0.14 |
injuries | Past 12 months | all | all | plantar fasciitis | 0.14 |
injuries | Past 12 months | all | all | foot | 0.13 |
injuries | Past 12 months | all | all | IT Band Syndrome | 0.12 |
injuries | Past 12 months | all | all | Lower back | 0.11 |
injuries | Past 12 months | all | all | shin splints | 0.11 |
injuries | Past 12 months | all | all | hasmstring | 0.1 |
injuries | Past 12 months | all | all | calf | 0.09 |
injuries | Past 12 months | all | all | ankle | 0.08 |
injuries | Past 12 months | all | all | achilles tendon | 0.07 |
injuries | Past 12 months | all | all | stress fracture | 0.03 |
injuries | Past 12 months | all | all | quadriceps | 0.02 |
injuries | Past 12 months | all | all | None of these | 0.23 |
injuries | how to deal with | all | all | take time off | 0.66 |
injuries | how to deal with | all | all | take anti-inflammatory | 0.58 |
injuries | how to deal with | all | all | stretch | 0.57 |
injuries | how to deal with | all | all | ice | 0.53 |
injuries | how to deal with | all | all | cross train | 0.32 |
injuries | how to deal with | all | all | run through it | 0.32 |
injuries | how to deal with | all | all | seek advice from other runners | 0.29 |
injuries | how to deal with | all | all | seek advice online | 0.27 |
injuries | how to deal with | all | all | see doctor | 0.27 |
As you can see Gender can be estimated by tracking the gender of the races that the visitor views. For example if the visitor only views a number of women’s races, we can assume that the user has a strong affinity for women’s racing (through friendship, family, coaching, or themselves). According to the Running USA 2012 Survey, 63% of runners in the adult running community are female. Using this information, and the users trend of only looking a women’s races, we can estimate that the person is most likely a woman, or has a strong affinity for women’s running.
Runner Type can be estimated by looking at the types of race results that a visitor tends to view. Do they prefer looking at championship races, or do they prefer looking at Tough Mudder results? The Running USA survey divides runners into 4 groups:
- Competitors - The people who enjoy running solely for the competition. These are the people who want to get a personal record, beat their friends, win their age group or win the whole race.
- Fun Runners - People who enjoy running, and do it purely for the enjoyment of the activity and the events
- Fitness Participants - Runners who participate in races to stay fit and keep a healthy lifestyle
- Outdoor Enthusiasts - People who enjoy being outdoors and do many outdoor activities, not just running
These runners can be identified by the types of runs that they participate in, and more importantly for us, the race results that they view. IF a visitor only tends to look at championship events, college meets, or only tends to view the results at the top of the page (indicating they only looked at the winners), it likely that this person has an affinity for competitive racing. On the other hand, if the runner only tends to look at less competition-focused events, like Color Me Rad, or Tough Mudder races, it is possible they are more interested in less competitive, fun races. According to the Running USA Survey, the majority of runners tend to more strongly identify as competitive runners (56%) who run for fun (52%). The fewest runners, 15% identify as outdoor enthusiasts. So, if a person only tends to look at the NCAA championships, World Championships, and Rock N Roll Half Marathon results, it is likely this person strongly prefers competitive racing.
Preferred Event Types While the Running USA Survey has limited coverage of this dimension, this is a very simple dimension to identify without using other data sets. For example, a visitor who only looks at a specific type of event can be identified as an enthusiast for that event. For example, if a visitor only tends to look at sprinting events (races <= 400m), it is not likely that they will click on an ad for the Rock N Roll Half Marathon.
Name | Events |
---|---|
Track & Field - Sprints | Distance <=400m |
Track & Field - Mid-Distance | 400m < distance <= 1600m |
Distance | 1600 < Distance >= 5k |
Field Events | Distance >Pole Vault, Hammer Throw, etc |
Team Affiliations/Interests Identifying what teams a user may be interested in can allow developers to recommend race results a user may be interested in based on the teams participating. While this approach also does not involve external datasets, it is incredibly simple to implement and (if done responsibly) can yield very informative information to visitors. This information can be found by finding the teams that participate in the races viewed by a visitor. If a user comes to the site and only looks at results for the Music City Challenge, the Tiger Paw Invitational, Penn Relays and the ACC Outdoor Championships, it is possible the visitor was looking at the Virginia Tech Hokies, as they attended 3/4 of those college track meets. We can also learn that they tend to enjoy competitive racing. This technique can also be used in non-team races to narrow down who a visitor may be a fan of (or who they may be), although, depending of the size of the races, may require many more race examples to adequately narrow down the potential list of participants.
Although, it is very unlikely that ALL the races that a visitor views will have their team in them. So we may want to use a softer, more forgiving approach by introducing an optional threshold
argument:
Narrowing Down Attributes: Injuries🔗
The previous scenario allows us to predict a set of attributes about any visitor where every page allows us to learn about each attribute. But many scenarios do not provide information on all potential attributes simultaneously. we may only learn about a subset of attributes at a time. In fact, in many cases we can’t even predict the order in which we will learn information about an attribute. These uncertainties can be resolved using conditional probabilities. A perfect example of this is information regarding athletic injuries. Obtaining accurate information regarding injuries can be difficult, as some athletes cannot provide all the details of their injury. Fortunately we can use existing datasets to narrow down injury attributes. We will be using data from publications from the Datalys Center.
Let’s say that a female cross country athlete provides us with pieces of information regarding an injury she received but is having trouble remembering the details, so we don’t know they remember or when/if she will remember them. As she is collegiate cross country runner, we can use the publication Epidemiology of National Collegiate Athletic Association Men’s and Women’s Cross-Country Injuries, 2009–2010 Through 2013–2014 from the Journal of Athletic Training.
The first detail is that she injured her lower leg. We can cross-reference this information with the data extracted from the Datalys Journal:
gender | part | percent of sample | rate per 1000 athlete exposures | time loss (%) | severe injuries (%) |
---|---|---|---|---|---|
female | head/face | 0.008 | 0.04 | 0 | 0 |
female | neck | 0.008 | 0.04 | 0 | 0 |
female | shoulder/clavicle | 0 | 0 | 0 | 0 |
female | arm/elbow | 0.004 | 0.02 | 0 | 0 |
female | hand/wrist | 0.004 | 0.02 | 0 | 0 |
female | trunk | 0.085 | 0.49 | 0.636 | 0.227 |
female | hip/groin | 0.112 | 0.65 | 0.586 | 0.138 |
female | thigh | 0.146 | 0.85 | 0.605 | 0.184 |
female | knee | 0.123 | 0.72 | 0.688 | 0.031 |
female | lower leg | 0.235 | 1.37 | 0.574 | 0.131 |
female | ankle | 0.05 | 0.29 | 0.615 | 0 |
female | foot | 0.154 | 0.9 | 0.525 | 0.225 |
female | other | 0.073 | 0.43 | 0.579 | 0 |
female | total | 1 | 5.83 | 0.588 | 0.131 |
male | head/face | 0.009 | 0.04 | 0 | 0 |
male | neck | 0 | 0 | 0 | 0 |
male | shoulder/clavicle | 0 | 0 | 0 | 0 |
male | arm/elbow | 0 | 0 | 0 | 0 |
male | hand/wrist | 0 | 0 | 0 | 0 |
male | trunk | 0.056 | 0.26 | 0.667 | 0.167 |
male | hip/groin | 0.032 | 0.15 | 0.571 | 0.143 |
male | thigh | 0.125 | 0.58 | 0.593 | 0.037 |
male | knee | 0.107 | 0.5 | 0.609 | 0.087 |
male | lower leg | 0.352 | 1.64 | 0.605 | 0.132 |
male | ankle | 0.13 | 0.6 | 0.464 | 0 |
male | foot | 0.157 | 0.73 | 0.765 | 0.059 |
male | other | 0.032 | 0.15 | 0.429 | 0 |
male | total | 1 | 4.66 | 0.611 | 0.083 |
According to the study, 23.5% of injuries to female collegiate cross country athletes are lower leg injuries. This calculation can be represented as P(Lower Leg | Female and College XC ) = 0.235
meaning the probability of a lower leg injury given being female and a collegiate cross country runner. According to this same study, 57% of female athletes with lower leg injuries were not able to compete for some length of time, but only 13%of lower leg injuries were serious injuries.
To take this a step further, of the 3 types of recorded injuries that occur to female cross country runners with lower leg injuries, inflammation is the most typical injury at 35% of injuries. The other injuries are only slightly less likely, making no injury type significantly likely.
gender | part | Injury | percent of sample | rate per 1000 athlete exposures | time loss (%) | severe injuries (%) |
---|---|---|---|---|---|---|
female | thigh | strain | 0.065 | 0.38 | 0.588 | 0.059 |
female | lower leg | inflammation | 0.065 | 0.38 | 0.706 | 0.059 |
female | lower leg | tendinitis | 0.062 | 0.36 | 0.688 | 0.063 |
female | lower leg | strain | 0.058 | 0.34 | 0.467 | 0.067 |
female | foot | inflammation | 0.058 | 0.34 | 0.667 | 0.2 |
female | thigh | inflammation | 0.05 | 0.29 | 0.769 | 0.231 |
female | knee | inflammation | 0.05 | 0.29 | 0.692 | 0.077 |
female | ankle | sprain | 0.042 | 0.25 | 0.545 | 0 |
female | hip/groin | strain | 0.039 | 0.22 | 0.6 | 0.1 |
female | respiratory | respiratory | 0.035 | 0.2 | 0.444 | 0 |
male | ankle | sprain | 0.111 | 0.52 | 0.458 | 0 |
male | lower leg | tendinitis | 0.097 | 0.45 | 0.571 | 0.048 |
male | lower leg | inflammation | 0.083 | 0.39 | 0.889 | 0 |
male | thigh | strain | 0.069 | 0.32 | 0.667 | 0 |
male | lower leg | strain | 0.069 | 0.32 | 0.6 | 0.133 |
male | foot | inflammation | 0.046 | 0.22 | 1 | 0 |
male | knee | inflammation | 0.032 | 0.15 | 0.714 | 0 |
male | hip/groin | strain | 0.028 | 0.13 | 0.667 | 0.167 |
male | knee | tendinitis | 0.028 | 0.13 | 0.5 | 0.167 |
male | lower leg | stress fracture | 0.023 | 0.11 | 0 | 0.4 |
At this point we have the following information:
- A Female colliegiate athlete injured her lower leg, which is a fairly common injury
- She will likely need to take/or has taken time off to recover
- There is only a 13% chance that it is a severe injury
So we actually already know a fair amount of information. She later recalls that She thinks she got injured during a race. Using high-level data on the participation rate of female cross-country athletes in the NCAA across the Divisions, we can use this information to determine what the most likely division she is in using Baye’s theorem.
category | division | gender | name | rate per 1000 athlete exposures |
---|---|---|---|---|
injury rates | 1 | female | practice | 6.52 |
injury rates | 1 | male | practice | 5.7 |
injury rates | 1 | female | competition | 9.5 |
injury rates | 1 | male | competition | 3.41 |
injury rates | 1 | female | total | 6.75 |
injury rates | 1 | male | total | 5.53 |
injury rates | 2 | female | practice | 1.96 |
injury rates | 2 | male | practice | 1.62 |
injury rates | 2 | female | competition | 6.41 |
injury rates | 2 | male | competition | 2.82 |
injury rates | 2 | female | total | 2.28 |
injury rates | 2 | male | total | 1.71 |
injury rates | 3 | female | practice | 6.13 |
injury rates | 3 | male | practice | 5.14 |
injury rates | 3 | female | competition | 5.77 |
injury rates | 3 | male | competition | 5.77 |
injury rates | 3 | female | total | 6.09 |
injury rates | 3 | male | total | 5.21 |
injury rates | overall | female | practice | 5.69 |
injury rates | overall | male | practice | 4.7 |
injury rates | overall | female | competition | 7.46 |
injury rates | overall | male | competition | 4.22 |
injury rates | overall | female | total | 5.85 |
injury rates | overall | male | total | 4.66 |
Baye’s theorem allows us to still find probabilities, regardless of the order in which we receive information. In order to get a probability for a value, we just need the probability of a
and b
, and the conditional_probability
of b given a
to find the conditional probability ofa given b
Based on Baye’s theorem, given that she was injured during competition, it is most likely that she competes for a Division 1 school or a Division 3 school (77%).
To be clear: I would never recommend using this as an actual method for making medical decisions. Please leave that to the certified professionals. I chose an example involving athletics because I enjoy the sport, not in order to advocate for using it. While this method of inference can be applied to just about any field, there are some (like individual health-care) where some restraint should be taken before applying this method. The above example is simply a demonstration of how these aggregate statistics can be used to infer knowledge in cases where uncertainty is very high.
Finding and selecting datasets🔗
Identifying reputable sources for data sets can be crucial for applying these techniques for narrowing down information.
- Pew Research
- Gallup
- World Bank
- Center for Disease Control
- Food and Agriculture of the United Nations
- World Health Organization (WHO)
- Google Big Query for U.S. Census data
- NCAA Statistics
- Datalys
The keys to a good dataset are this
- Large datasets
- Unbiased data
- Population of data is directly relevant to what you are trying to predict
When you are finding datasets, be sure that the dataset is similar to the data you will be applying the produced statistics to. For example, if you are collecting demographic data to be used on estimating demographics of new users in Virginia, using demographics statistics from Virginia in the 1970’s or the state of North Carolina will most likely not provide very accurate estimations of Virginia users in the present day. Due to changes in job market, ease of travel, and immigration trends, Virginia does not have the same demographic as it did in the 1970s, so the predictions would reflect the demographics of the 1970s, not the demographics of the present day.
Conclusion🔗
Using Baye’s Theorem for predictive purposes can be useful, especially when only aggregate statistics are available. The areas of application for this technique are wide, but I would not recommend using it for making decisions where the risk of a wrong decision can have life-altering consequences (ex. cancer detection). This technique is for use in cases where the cost of making a wrong decision can be recovered from.
I have applied this technique to internal business processes for my work, predicting characteristics of individuals based on demographic information, and for analyzing extensive running data (as demonstrated above). I hope you find this technique as useful and as simple as I did.