Assumptions:
Basically: you can handle small data already.
My goal for today: give you a basic idea how to use Spark when the data gets bigger.
Summary: If you can use Pandas, then Spark isn't much harder. And it scales out if you need it to.
big dataproblems do you have?
First task: start downloading things from the instructions, http://bit.ly/bigdata-workshop-2017 .
Let's make sure you can get things running…
No? Log in as csguestd
on the computers here.
If there's time, we can experiment on a cluster too: SSH with username csguestd
to gateway.sfucloud.ca
.