Data Science Workshop

Workshop, February 2018

Greg Baker

https://bit.ly/cebu-2018-slides

This Workshop

What do you want to do? What would be useful?

What is Data Science?

According to Wikipedia: an interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms…

According to Pat Hanrahan, Tableau Software: [The combination of] business knowledge, analytical skills, and computer science.

According to Daniel Tunkelang, LinkedIn: [The ability to] obtain, scrub, explore, model and interpret data, blending hacking, statistics and machine learning.

What is Data Science?

According to Joel Grus: There's a joke that says a data scientist is someone who knows more statistics than a computer scientist and more computer science than a statistician.… We'll says that a data scientist is someone who extracts insights from messy data.

What is Data Science?

According to Drew Conway, Alluvium:

What is Data Science?

My working definition:

You get some data. Then what do you do to get answers from it? Whatever that is, that's data science.

Of course, one of those things is statistics.

Why Data Science?

Why is data science suddenly so popular?

There's more data being collected: web access logs, purchase history, click-through rates, location history, sensor data, ….

Sometimes the volume of data is big: too big to manage easily, or with a single computer. That's where big data usually starts.

Why Data Science?

People want answers/​insights from that data: Is the marketing campaign working? Is the UI actually usable? What if we did X instead of Y?

New techniques: Machine learning lets us attack questions that were previously unanswerable. Computer scientists are realizing that statistics is important; statisticians are realizing that computer science is important.

Data Pipeline

  1. Figure out the question.
  2. Find/​acquire relevant data.
  3. Clean & prepare the data.
  4. Analyze the data.
  5. Interpret & present results.
adapted from Spending Data Handbook; 3 Principles for Working with Big Data