Python and Pandas

Python and Pandas

Python doesn't seem like an obvious choice for data analysis.

Fun to write, but not noted for being fast, and kind of bad at arrays.

NumPy

NumPy is the standard solution for array-like data in Python.

Gives a data types for fixed-type \(n\)-dimensional arrays. Represented as a C-style array internally.

Also includes many useful tools for working with those arrays efficiently.

NumPy

Importing:

import numpy as np

By convention, import as np (not numpy) to save a few characters (since it's used a lot).

NumPy

a = np.array([10, 20, 30, 40], dtype=np.int)

This allocates an array equivalent to the C:

int values[4] = {10, 20, 30, 40}

[Initializing from a Python list is usually a bad idea: requires memory allocation for both.]

NumPy

NumPy arrays have a fixed type for all elements. Chosen from the NumPy types (or custom types). Some examples:

  • np.byte
  • np.int32
  • np.int64
  • np.uint64
  • np.double
  • np.complex
  • np.str_
  • np.object_

NumPy

Creating from Python lists is okay for toy examples, but you should usually construct more efficiently.

print( np.arange(6, 14) )
print( np.linspace(0, 5, 11, dtype=np.float32) )
print( np.empty((2, 3), dtype=np.int) ) # uninitialized!
print( np.zeros((2, 3)) )
[ 6  7  8  9 10 11 12 13]
[ 0.   0.5  1.   1.5  2.   2.5  3.   3.5  4.   4.5  5. ]
[[140506921020328 140506921020328 140506806120376]
 [140506776227352 632708812243096               0]]
[[ 0.  0.  0.]
 [ 0.  0.  0.]]

NumPy

NumPy is happy with large arrays (up to your memory limits):

print( np.zeros((1000, 10000)) )
[[ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ..., 
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]]

NumPy

There is a collection of array creation functions for convenience:

print( np.identity(4) )
print( np.random.rand(3,4) )
[[ 1.  0.  0.  0.]
 [ 0.  1.  0.  0.]
 [ 0.  0.  1.  0.]
 [ 0.  0.  0.  1.]]
[[ 0.50385525  0.85417757  0.02541011  0.83578425]
 [ 0.44444898  0.6818858   0.8581728   0.09154115]
 [ 0.86676983  0.39580819  0.52601014  0.83193706]]

NumPy

Result: memory efficiency like C. Programming with Python.

I can often write Python + NumPy (and maybe + numexpr) code that's faster than the C code I can write… but I'm not very good at C.

😄

NumPy Exercise

Let's try it. In your notebook:

import numpy as np
import pandas as pd
rand = np.random.rand(5, 5)
rand
np.sum(rand)
rand.sum()
rand + 1
np.sqrt(rand)

NumPy Exercise

Important lesson: operate on arrays with NumPy operations, not Python loops.

The speed difference can be hundreds of times.

Pandas

NumPy is very good at what it does: arrays.

Single-type arrays and operations on them aren't all there is to data storage and manipulation.

Pandas

Pandas is a Python library that gives richer tools for manipulating data.

  • A Series is a 1D array, internally stored as a NumPy array.
  • A DataFrame is a collection of columns (Series); the Series data forms rows.
  • Entries in Series and DataFrames are labeled.

Pandas

In some ways, a DataFrame is a lot like a SQL database table:

  • DF Series ≈ SQL columns. Each column has a fixed type.
  • DF row ≈ SQL row. One entry from each column.
  • DF label ≈ SQL primary key.

… but the way you work with the data is usually quite different.

Working With Pandas

The Pandas module is usually imported as a short-form, like NumPy:

import numpy as np
import pandas as pd

Working With Pandas

DataFrames can be created many ways, including a Python dict:

df = pd.DataFrame(
    {
        'value': [1, 2, 3, 4],
        'word': ['one', 'two', 'three', 'four']
    }
)
print(df)
​   value   word
0      1    one
1      2    two
2      3  three
3      4   four

Working With Pandas

In a notebook, DataFrames are even nicer looking:

a DataFrame in a Jupyter notebook

Working With Pandas

You can read/​write entire series with a single operation:

df['double'] = df['value'] * 2
print(df)
print(df[df['value'] % 2 == 0])
​   value   word  double
0      1    one       2
1      2    two       4
2      3  three       6
3      4   four       8
   value  word  double
1      2   two       4
3      4  four       8

Working With Pandas

… actually, you should work on an entire DataFrame as a single operation. It's cleaner (less code) and faster (all of the operations are done in highly-optimized C).

In my classes, you must. All loops and recursion are banned.

Working With Pandas

Another example:

cities = pd.DataFrame(
    [[2463431, 2878.52], [1392609, 5110.21], [5928040, 5905.71]],
    columns=['population', 'area'],
    index=pd.Index(['Vancouver','Calgary','Toronto'], name='city')
)
print(cities)
​           population     area
city                          
Vancouver     2463431  2878.52
Calgary       1392609  5110.21
Toronto       5928040  5905.71

Working With Pandas

You can apply Python operators, or methods on a Series, or a NumPy ufunc, or ….

cities['density'] = cities['population'] / cities['area']
cities['pop-rank'] = cities['population'].rank(ascending=False)
print(cities)
​           population     area      density  pop-rank
city                                                 
Vancouver     2463431  2878.52   855.797771       2.0
Calgary       1392609  5110.21   272.515024       3.0
Toronto       5928040  5905.71  1003.781086       1.0

Pandas Exercises

Enough slides… let's work with Pandas.

Go to https://bit.ly/cebu-2018-slides and follow the link Workshop code and exercises.