DataFrames are something many have worked with (in R, maybe). In Python, the Pandas library provides DataFrames which…
If you're familiar with SQL databases:
import numpy as np import pandas as pd df = pd.DataFrame({'a': ['A', 'B', 'C'], 'b': [4, 5, 6]}) print(df) print() print(df.dtypes)
a b 0 A 4 1 B 5 2 C 6 a object b int64 dtype: object
Data is usually going to come from a data source, not from the code.
import numpy as np import pandas as pd df1 = pd.read_csv('somedata.csv', header=0) df2 = pd.read_hdf('somedata.h5', key='data2') connection = ??? df3 = pd.read_sql_query( 'SELECT col1, col2 FROM tbl WHERE v>0', connection )
A favourite data set of mine: the Reddit Comment Corpus.
import numpy as np import pandas as pd import gzip uncompressed = gzip.open('part-00000.json.gz', 'rt', encoding='utf-8') data = pd.read_json(uncompressed, lines=True) print(data.loc[0])
… contains information about every Reddit comment.
archived True author faceplanted author_flair_css_class NaN author_flair_text NaN body You know, I've had the feeling recently that p... controversiality 0 created_utc 1322414722 downs 0 edited false gilded 0 id c3352f2 link_id t3_monh5 month 11 name t1_c3352f2 parent_id t1_c32qtj3 retrieved_on 1427936247 score 1 score_hidden False subreddit xkcd subreddit_id t5_2qh0z ups 1 Name: 0, dtype: object
The task: calculate the average score for comments in each subreddit.
Next data set to play with: a subset of the data from the Global Historical Climatology Network.
It contains daily weather observations from thousands of weather stations for many years.
qflag
(quality flag) is null;'CA'
;'TMAX'
.