Machine learning techniques are very common tools for data analysis.
ML techniques are increasingly important: they are the way that many problems are being attacked.
The basic premise of a [supervised] machine learning problem:
To get anywhere, we need to have a lot of correct input/​output pairs.
We will use most of them to train the model. Hopefully it will find whatever relevant structure/​patterns are in the data, and make good predictions of the output later.
But how will we know if good predictions are being made?
Usually, we want to break up the known input/​outputs into two sets: training data to train the model and testing data to test how good the predictions are.
from sklearn.model_selection import train_test_split X = known_inputs y = corresponding_outputs X_train, X_test, y_train, y_test = train_test_split(X, y)
The scikit-learn module implements many machine learning algorithms and corresponding tools.
It's still important to understand how the models work: you won't know their strengths and weaknesses otherwise. But implementing them can be done by somebody else.
The models implemented in scikit-learn all have the same general API. First create the model with whatever parameters it needs:
model = SVC(kernel='linear', C=0.05)
Then train it with the training data:
model.fit(X_train, y_train)
Once you have a model, you can check how it does on the testing data:
print(model.score(X_test, y_test))
0.97
You can fiddle with the model and its parameters to get the best possible results.
And finally, make some predicitons on new inputs:
model.predict(X_new)
Let's try something…