Pandas Basics

DataFrames are something many have worked with (in R, maybe). In Python, the Pandas library provides DataFrames which…

  • store data arranged in rows and columns.
  • have a schema: column names and each column has a type.
  • refer to columns as series.

Pandas Basics

If you're familiar with SQL databases:

  • DF schema ≈ SQL table schema.
  • DF Series ≈ SQL columns.
  • DF row ≈ SQL row. One entry from each column.

Pandas Basics

import numpy as np
import pandas as pd

df = pd.DataFrame({'a': ['A', 'B', 'C'], 'b': [4, 5, 6]})
print(df)
print()
print(df.dtypes)
​   a  b
0  A  4
1  B  5
2  C  6

a    object
b     int64
dtype: object

Pandas Basics

Data is usually going to come from a data source, not from the code.

import numpy as np
import pandas as pd

df1 = pd.read_csv('somedata.csv', header=0)
df2 = pd.read_hdf('somedata.h5', key='data2')

connection = ???
df3 = pd.read_sql_query(
    'SELECT col1, col2 FROM tbl WHERE v>0',
    connection
)

Exercise: Reddit Averages

A favourite data set of mine: the Reddit Comment Corpus.

import numpy as np
import pandas as pd
import gzip

uncompressed = gzip.open('part-00000.json.gz', 'rt', encoding='utf-8')
data = pd.read_json(uncompressed, lines=True)
print(data.loc[0])

Exercise: Reddit Averages

… contains information about every Reddit comment.

archived                                                               True
author                                                          faceplanted
author_flair_css_class                                                  NaN
author_flair_text                                                       NaN
body                      You know, I've had the feeling recently that p...
controversiality                                                          0
created_utc                                                      1322414722
downs                                                                     0
edited                                                                false
gilded                                                                    0
id                                                                  c3352f2
link_id                                                            t3_monh5
month                                                                    11
name                                                             t1_c3352f2
parent_id                                                        t1_c32qtj3
retrieved_on                                                     1427936247
score                                                                     1
score_hidden                                                          False
subreddit                                                              xkcd
subreddit_id                                                       t5_2qh0z
ups                                                                       1
Name: 0, dtype: object

Exercise: Reddit Averages

The task: calculate the average score for comments in each subreddit.

Exercise: Weather Extraction

Next data set to play with: a subset of the data from the Global Historical Climatology Network.

It contains daily weather observations from thousands of weather stations for many years.

Exercise: Weather Extraction

  1. Keep only the records we care about:
    1. field qflag (quality flag) is null;
    2. the station starts with 'CA';
    3. the observation is 'TMAX'.
  2. Divide the temperature by 10 so it's actually in °C.
  3. Keep only the columns station, date, and tmax (which is the value after dividing by 10).