Python doesn't seem like an obvious choice for data analysis.
Fun to write, but not noted for being fast, and kind of bad at arrays.
NumPy is the standard solution for array-like data in Python.
Gives a data types for fixed-type \(n\)-dimensional arrays. Represented as a C-style array internally.
Also includes many useful tools for working with those arrays efficiently.
Importing:
import numpy as np
By convention, import as np
(not numpy
) to save a few characters (since it's used a lot).
a = np.array([10, 20, 30, 40], dtype=np.int)
This allocates an array equivalent to the C:
int values[4] = {10, 20, 30, 40}
[Initializing from a Python list is usually a bad idea: requires memory allocation for both.]
NumPy arrays have a fixed type for all elements. Chosen from the NumPy types (or custom types). Some examples:
np.byte
np.int32
np.int64
np.uint64
np.double
np.complex
np.str_
np.object_
Creating from Python lists is okay for toy examples, but you should usually construct more efficiently.
print( np.arange(6, 14) ) print( np.linspace(0, 5, 11, dtype=np.float32) ) print( np.empty((2, 3), dtype=np.int) ) # uninitialized! print( np.zeros((2, 3)) )
[ 6 7 8 9 10 11 12 13] [ 0. 0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5 5. ] [[140506921020328 140506921020328 140506806120376] [140506776227352 632708812243096 0]] [[ 0. 0. 0.] [ 0. 0. 0.]]
NumPy is happy with large arrays (up to your memory limits):
print( np.zeros((1000, 10000)) )
[[ 0. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.] ..., [ 0. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.]]
There is a collection of array creation functions for convenience:
print( np.identity(4) ) print( np.random.rand(3,4) )
[[ 1. 0. 0. 0.] [ 0. 1. 0. 0.] [ 0. 0. 1. 0.] [ 0. 0. 0. 1.]] [[ 0.50385525 0.85417757 0.02541011 0.83578425] [ 0.44444898 0.6818858 0.8581728 0.09154115] [ 0.86676983 0.39580819 0.52601014 0.83193706]]
Result: memory efficiency like C. Programming with Python.
I can often write Python + NumPy (and maybe + numexpr) code that's faster than the C code I can write… but I'm not very good at C.
😄Let's try it. In your notebook:
import numpy as np import pandas as pd
rand = np.random.rand(5, 5) rand
np.sum(rand)
rand.sum()
rand + 1
np.sqrt(rand)
Important lesson: operate on arrays with NumPy operations, not Python loops.
The speed difference can be hundreds of times.
NumPy is very good at what it does: arrays.
Single-type arrays and operations on them aren't all there is to data storage and manipulation.
Pandas is a Python library that gives richer tools for manipulating data.
In some ways, a DataFrame is a lot like a SQL database table:
… but the way you work with the data is usually quite different.
The Pandas module is usually imported as a short-form, like NumPy:
import numpy as np import pandas as pd
DataFrames can be created many ways, including a Python dict:
df = pd.DataFrame( { 'value': [1, 2, 3, 4], 'word': ['one', 'two', 'three', 'four'] } ) print(df)
value word 0 1 one 1 2 two 2 3 three 3 4 four
In a notebook, DataFrames are even nicer looking:
You can read/​write entire series with a single operation:
df['double'] = df['value'] * 2 print(df) print(df[df['value'] % 2 == 0])
value word double 0 1 one 2 1 2 two 4 2 3 three 6 3 4 four 8 value word double 1 2 two 4 3 4 four 8
… actually, you should work on an entire DataFrame as a single operation. It's cleaner (less code) and faster (all of the operations are done in highly-optimized C).
In my classes, you must. All loops and recursion are banned.
Another example:
cities = pd.DataFrame( [[2463431, 2878.52], [1392609, 5110.21], [5928040, 5905.71]], columns=['population', 'area'], index=pd.Index(['Vancouver','Calgary','Toronto'], name='city') ) print(cities)
population area city Vancouver 2463431 2878.52 Calgary 1392609 5110.21 Toronto 5928040 5905.71
You can apply Python operators, or methods on a Series, or a NumPy ufunc, or ….
cities['density'] = cities['population'] / cities['area'] cities['pop-rank'] = cities['population'].rank(ascending=False) print(cities)
population area density pop-rank city Vancouver 2463431 2878.52 855.797771 2.0 Calgary 1392609 5110.21 272.515024 3.0 Toronto 5928040 5905.71 1003.781086 1.0
Enough slides… let's work with Pandas.
Go to https://bit.ly/cebu-2018-slides and follow the link Workshop code and exercises
.