Greg Baker
UP Cebu, February 2018
These slides: https://bit.ly/cebu-2018-slides
Senior Lecturer at Simon Fraser University in Vancouver, Canada.
Accidental data scientist.
Teach a 3rd year undergraduate Data Science course, and MSc-level Big Data lab course.
According to Wikipedia: an interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms…
According to Pat Hanrahan, Tableau Software: [The combination of] business knowledge, analytical skills, and computer science.
According to Daniel Tunkelang, LinkedIn: [The ability to] obtain, scrub, explore, model and interpret data, blending hacking, statistics and machine learning.
According to Joel Grus: There's a joke that says a data scientist is someone who knows more statistics than a computer scientist and more computer science than a statistician.… We'll say that a data scientist is someone who extracts insights from messy data.
According to Drew Conway, Alluvium:
My working definition:
You get some data. Then what do you do to get answers from it? Whatever that is, that's data science.
Of course, one of those things is statistics.
Why is data science
suddenly so popular?
There's more data being collected: web access logs, purchase history, click-through rates, location history, sensor data, ….
Sometimes the volume of data is big: too big to manage easily, or with a single computer. That's where big data
usually starts.
People want answers/​insights from that data: Is the marketing campaign working? Is the UI actually usable? What if we did X instead of Y?
New techniques: Machine learning lets us attack questions that were previously unanswerable. Computer scientists are realizing that statistics is important; statisticians are realizing that computer science is important.
If this is like every other stats course in the world, data usually looks like this:
Daily Users | Daily Page Views |
---|---|
87 | 348 |
31 | 219 |
104 | 401 |
83 | 302 |
… and you're asked if users and page views have a linear relation.
But data is rarely that beautiful when you find it. That data probably started as thousands of lines like:
www-d3.proxy.aol.com - - [01/Aug/1995:00:01:52 -0400] "GET /shuttle/missions/sts-71/mission-sts-71.html HTTP/1.0" 200 13450 in24.inetnebr.com - - [01/Aug/1995:00:01:54 -0400] "GET /shuttle/missions/sts-68/news/sts-68-mcc-09.txt HTTP/1.0" 200 2166 205.163.36.61 - - [01/Aug/1995:00:01:55 -0400] "GET /shuttle/countdown/countdown.html HTTP/1.0" 200 4324 205.163.36.61 - - [01/Aug/1995:00:01:57 -0400] "GET /shuttle/countdown/count70.gif HTTP/1.0" 304 0 rpgopher.aist.go.jp - - [01/Aug/1995:00:01:58 -0400] "GET /ksc.html HTTP/1.0" 200 7280
… and you have to figure out which line is from which day, which request comes from each user, etc. Then you can do some stats.
Also, questions don't usually come as nicely phrased as is there a linear relationship between number of users and page views?
They are often more like why has the number of users dropped?
or is the ad campaign working?
It's worth thinking of what happens to your data as a process:
Figure out the question
should not be underestimated.
People don't usually ask questions that map to techniques you know. They don't always ask coherent questions that can be answered.
How you find/​acquire relevant data
will depend on the problem. Maybe you have to…
Most working data scientists admit that clean & prepare the data
takes more than half of their time. Exactly how to clean the data requires understanding it.
e.g. For the web server logs: must parse the lines and aggregate the results by day/​user. Need to decide what data matters: HTML vs all requests; redirects or errors; …
Most aspiring data scientists imagine analyze the data
as 100% of what they'll be doing. Could include techniques from…
Don't forget how important it is to interpret & present results
.
Clever results are useless if you don't tell anybody (or they don't understand).
This process is inter-disciplinary, but a CS student in a stats class has a good start.
Inferrential statistics are one way to get answers out of data… the one we care about today.
Some basic inferential stats can get a lot of results out of data, but be aware of your limits.
There are many statistical tests that might be useful.
For Python, SciPy's stats
module implements 31 functions described as a test
and StatsModels has many more.
How can you make sense of that to get the answers you're looking for?
Textbooks often describe tests like this, which is true but not very useful:
The t-test is any statistical hypothesis test in which the test statistic follows a Student's t-distribution under the null hypothesis. Wikipedia: Student's t-test
A few sentences later is a description that's more useful, but not complete:
The t-test can be used, for example, to determine if two sets of data are significantly different from each other. Wikipedia: Student's t-test
The basic idea:
Every test has a null hypothesis (\(H_0\)). We start by assuming that \(H_0\) is true. Usually \(H_0\) is nothing's happening
.
The test gives us a \(p\) value that tells us something about how unusual the data is, assuming \(H_0\).
If the \(p\) value is very small, we decide that \(H_0\) is likely false, so the opposite (\(H_a\)) is likely true.
If \(p\) isn't small, we have no conclusion.
Basically, all inferential stats is like a proof by contradiction, but probabilistic.
What do you actually need to know about a statistical test to use it to get something done?
These are the things I want to know about a test:
realconclusion about the world?
So I understand the T-test as:
How I actually do a T-test, with Python:
from scipy import stats group1 = ... group2 = ... ttest = stats.ttest_ind(group1, group2) print(ttest.pvalue)
Prints a \(p\) value, like:
0.017071681233782634
Remember: the \(p\) value is just a probability. Sometimes, you'll reach an incorrect conclusion (false positive) by chance.
When your data gets interesting, it's easy to abuse statistics to make mistakes.
If you have 10 groups in your data set, and start to use a T-test to find out if they're difference, you'll do \(C(10,2) = 45\) tests.
The probability of getting no false positives (with \(p<0.05\)) is then
\[ 0.95^{45} = 0.09944\,. \]So you're very likely to find something incorrect by accident.
Don't belive math? What about code?
When I run that, I find two significantly different
results, even though all the data is from the same distribution.
Be careful of doing this by accident: trying one test after another and ignoring the ones that don't give results.
It's called p-hacking. See also the Bonferroni correction to avoid it.
So when you have some imperfect questions and data, how do you start?
It's important to have a plan for your analysis: what test(s) are you going to do? If you have an honest plan before you start, then you won't be p-hacking.
Then you have to decide if your data meets the assumptions of your test.
e.g. Are the groups (close enough to being) normally distributed, so you can do a T-test? Are the variances the same (or similar)?
There is a lot more to data analysis than just statistics, but statistics is an important tool to have available.
When I took a stats course, I thought it was useless.
I was wrong.
Great book that's free online: Python Data Science Handbook, Jake VanderPlas.
See everything implemented in Python: Data Science from Scratch, Joel Grus
Come take my course? SFU's Professional Master's Program in Big Data.