What To Do When Data Becomes Big Data

Workshop, Fall 2017

Greg Baker

http://bit.ly/bigdata-workshop-2017

This Workshop

Assumptions:

  • Some reasonable Python programming background.
  • Hopefully some Pandas DataFrame experience.

Basically: you can handle small data already.

This Workshop

My goal for today: give you a basic idea how to use Spark when the data gets bigger.

Summary: If you can use Pandas, then Spark isn't much harder. And it scales out if you need it to.

Introductions

Who are you and what are you doing here? What big data problems do you have?

Setup

First task: start downloading things from the instructions, http://bit.ly/bigdata-workshop-2017 .

Let's make sure you can get things running…

No? Log in as csguestd on the computers here.

Setup

If there's time, we can experiment on a cluster too: SSH with username csguestd to gateway.sfucloud.ca.