DataFrames are something many have worked with (in R, maybe). In Python, the Pandas library provides DataFrames which…
If you're familiar with SQL databases:
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': ['A', 'B', 'C'], 'b': [4, 5, 6]})
print(df)
print()
print(df.dtypes)
a b 0 A 4 1 B 5 2 C 6 a object b int64 dtype: object
Data is usually going to come from a data source, not from the code.
import numpy as np
import pandas as pd
df1 = pd.read_csv('somedata.csv', header=0)
df2 = pd.read_hdf('somedata.h5', key='data2')
connection = ???
df3 = pd.read_sql_query(
'SELECT col1, col2 FROM tbl WHERE v>0',
connection
)
A favourite data set of mine: the Reddit Comment Corpus.
import numpy as np
import pandas as pd
import gzip
uncompressed = gzip.open('part-00000.json.gz', 'rt', encoding='utf-8')
data = pd.read_json(uncompressed, lines=True)
print(data.loc[0])
… contains information about every Reddit comment.
archived True author faceplanted author_flair_css_class NaN author_flair_text NaN body You know, I've had the feeling recently that p... controversiality 0 created_utc 1322414722 downs 0 edited false gilded 0 id c3352f2 link_id t3_monh5 month 11 name t1_c3352f2 parent_id t1_c32qtj3 retrieved_on 1427936247 score 1 score_hidden False subreddit xkcd subreddit_id t5_2qh0z ups 1 Name: 0, dtype: object
The task: calculate the average score for comments in each subreddit.
Next data set to play with: a subset of the data from the Global Historical Climatology Network.
It contains daily weather observations from thousands of weather stations for many years.
qflag (quality flag) is null;'CA';'TMAX'.