第一次上kaggle来做实训,第一印象界面美观,向导友好,考虑周全,是一个比较成熟的平台。
数据集也很丰富,有关于欧洲足球历史比分的,美国总统竞选分析的,计算机语言使用调查的,人力资源分析,历史上的飞机事故统计,IMDB电影得分的数据分析,还有些脱敏的金融借贷和风险信息。
找到了Titanic数据集,跟着向导第一次做任务,DataCamp中的课程有任务说明,可以根据提示写代码,然后提交,错误还可以根据提示进行修正,直到教会你为止。感觉和以前打一个新游戏的任务向导很像。
需求分析
有两部分关于Titanic乘客的信息,一部分是Train数据,一部分是Test数据,通过分析Train特征数据以及标签数据“是否幸存”进行数据清洗,特征选取,建立决策模型来预测Test数据的乘客是否幸存?
字段含义
VARIABLE DESCRIPTIONS:
survival Survival
(0 = No; 1 = Yes)
pclass Passenger Class
(1 = 1st; 2 = 2nd; 3 = 3rd)
name Name
sex Sex
age Age
sibsp Number of Siblings/Spouses Aboard
parch Number of Parents/Children Aboard
ticket Ticket Number
fare Passenger Fare
cabin Cabin
embarked Port of Embarkation
(C = Cherbourg; Q = Queenstown; S = Southampton)
Python代码
import numpy as np
from sklearn import tree
import pandas as pd
官方安装numpy和scipy库的时候一直报错,后来找到了下载链接,提供了很多非官方的python库
cmd >> python -m pip install xx.whl >> 安装成功。
步骤 1 导入并观察数据
train_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"
train = pd.read_csv(train_url)
test_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv"
test = pd.read_csv(test_url)
#Print the `head` of the train and test dataframes
print(train.describe())
print(test.describe())
PassengerId Pclass Age SibSp Parch Fare
count 418.000000 418.000000 332.000000 418.000000 418.000000 417.000000
mean 1100.500000 2.265550 30.272590 0.447368 0.392344 35.627188
std 120.810458 0.841838 14.181209 0.896760 0.981429 55.907576
min 892.000000 1.000000 0.170000 0.000000 0.000000 0.000000
25% 996.250000 1.000000 21.000000 0.000000 0.000000 7.895800
50% 1100.500000 3.000000 27.000000 0.000000 0.000000 14.454200
75% 1204.750000 3.000000 39.000000 1.000000 0.000000 31.500000
max 1309.000000 3.000000 76.000000 8.000000 9.000000 512.329200
发现Fare和Age中有部分值是空值,在进行模型训练之前要对其进行数值填充。
当时的欧洲绅士们倡导女士优先的传统,所以看看性别对于预测标签的影响。
步骤2 分析特征值性别和目标标签的关系
# Passengers that survived vs passengers that passed away
print(train["Survived"].value_counts())
# As proportions
print(train["Survived"].value_counts(normalize = True))
# Males that survived vs males that passed away
print(train["Survived"][train["Sex"] == 'male'].value_counts())
# Females that survived vs Females that passed away
print(train["Survived"][train["Sex"] == 'female'].value_counts())
# Normalized male survival
print(train["Survived"][train["Sex"] == 'male'].value_counts(normalize = True))
# Normalized female survival
print(train["Survived"][train["Sex"] == 'female'].value_counts(normalize = True))
<script.py> output:
0 549
1 342
Name: Survived, dtype: int64
0 0.616162
1 0.383838
Name: Survived, dtype: float64
0 468
1 109
Name: Survived, dtype: int64
1 233
0 81
Name: Survived, dtype: int64
0 0.811092
1 0.188908
Name: Survived, dtype: float64
1 0.742038
0 0.257962
Name: Survived, dtype: float64
通过分析性别和幸存的关系发现,有18%的男性和74%的女性幸存。所以对于测试数据集来说,假如全部判断为女性幸存,理论上也会有74%的正确率,这个是一个基准线。
另外,我们知道年纪小的孩子有优先上救生船的权利。
步骤3 分析特征值年龄和目标标签幸存的关系
为了方便统计以及之后决策树建模的训练,把连续性变量年龄统一成离散型分类变量
# Create the column Child and assign to 'NaN'
train["Child"] = float('NaN')
# Assign 1 to passengers under 18, 0 to those 18 or older. Print the new column.
train["Child"][train["Age"] < 18] = 1
train["Child"][train["Age"] >= 18] = 0
print(train)
# Print normalized Survival Rates for passengers under 18
print(train["Survived"][train["Child"] == 1].value_counts(normalize = True))
# Print normalized Survival Rates for passengers 18 or older
print(train["Survived"][train["Child"] == 0].value_counts(normalize = True))
1 0.539823
0 0.460177
Name: Survived, dtype: float64
0 0.618968
1 0.381032
Name: Survived, dtype: float64
有53%的未成年人幸存,有38%的成年人幸存。
步骤4 数据清洗和数据格式转换
# Convert the male and female groups to integer form
train["Sex"][train["Sex"] == "male"] = 0
train["Sex"][train["Sex"] == "female"] = 1
# Impute the Embarked variable
train["Embarked"] = train["Embarked"].fillna("S")
# Convert the Embarked classes to integer form
train["Embarked"][train["Embarked"] == 'S'] = 0
train["Embarked"][train["Embarked"] == 'C'] = 1
train["Embarked"][train["Embarked"] == 'Q'] = 2
为了使得决策树模型能够正常的,高效的工作,一定要对数据进行清洗:
- 让性别转换成0和1变量
- 对缺损的年龄字段进行均值填充
- 把Embarked变量进行格式转换成离散数值变量
一般来说,数据清洗和特征选取占到整个数据分析时间的70%~80%,是预测是否能够准确的重要部分。
步骤5 决策树建模以及训练模型
# Import the Numpy library
import numpy as np
# Import 'tree' from scikit-learn library
from sklearn import tree
# Print the train data to see the available features
print(train)
# Create the target and features numpy arrays: target, features_one
target = train["Survived"].values
features_one = train[["Pclass", "Sex", "Age", "Fare"]].values
# Fit your first decision tree: my_tree_one
my_tree_one = tree.DecisionTreeClassifier()
my_tree_one = my_tree_one.fit(features_one, target)
# Look at the importance and score of the included features
print(my_tree_one.feature_importances_)
print(my_tree_one.score(features_one, target))
使用了科学计算的库Numpy和机器学习的库sklearn,对特征值进行数据模型拟合。
[ 0.12545743 0.31274009 0.23086653 0.33093596]
0.977553310887
出乎意料之外的是Fare字段对于标签预测的权重作用最大,占到了33%。预测准确率是97%。
步骤6 利用训练模型预测测试数据
# Impute the missing value with the median
test.Fare[152] = test.Fare.median()
# Extract the features from the test set: Pclass, Sex, Age, and Fare.
test_features = test[["Pclass", "Sex", "Age", "Fare"]].values
# Make your prediction using the test set and print them.
my_prediction = my_tree_one.predict(test_features)
print(my_prediction)
对Fare的缺损数据进行均值填充,并且利用训练模型预测测试数据集。
步骤7 把预测结果导出到csv
# Create a data frame with two columns: PassengerId & Survived. Survived contains your predictions
PassengerId =np.array(test["PassengerId"]).astype(int)
my_solution = pd.DataFrame(my_prediction, PassengerId, columns = ["Survived"])
print(my_solution)
# Check that your data frame has 418 entries
print(my_solution.shape)
# Write your solution to a csv file with the name my_solution.csv
my_solution.to_csv("my_solution_one.csv", index_label = ["PassengerId"])
补充1:决策树参数调整
# Create a new array with the added features: features_two
features_two = train[["Pclass","Age","Sex","Fare", "SibSp", "Parch", "Embarked"]].values
#Control overfitting by setting "max_depth" to 10 and "min_samples_split" to 5 : my_tree_two
max_depth = 10
min_samples_split = 5
my_tree_two = tree.DecisionTreeClassifier(max_depth = 10, min_samples_split = 5, random_state = 1)
my_tree_two = my_tree_two.fit(features_two, target)
#Print the score of the new decison tree
print(my_tree_two.feature_importances_)
print(my_tree_two.score(features_two, target))
Maybe we can improve the overfit model by making a less complex model? In DecisionTreeRegressor
, the depth of our model is defined by two parameters: - the max_depth
parameter determines when the splitting up of the decision tree stops. - the min_samples_split
parameter monitors the amount of observations in a bucket. If a certain threshold is not reached (e.g minimum 10 passengers) no further splitting can be done.
为了避免有可能出现的决策树过拟合的可能,我们需要对决策树进行“剪枝”,有下列参数可以优化调整
**max_features: **选择最适属性时划分的特征不能超过此值。
max_depth: (default=None)设置树的最大深度,默认为None,这样建树时,会使每一个叶节点只有一个类别,或是达到min_samples_split。
min_samples_split****:根据属性划分节点时,每个划分最少的样本数。
min_samples_leaf:叶子节点最少的样本数。
**max_leaf_nodes: **(default=None)叶子树的最大样本数。
补充2:特征工程--尝试建立新的特征值
Data Science is an art that benefits from a human element. Enter feature engineering: creatively engineering your own features by combining the different existing variables.
While feature engineering is a discipline in itself, too broad to be covered here in detail, you will have a look at a simple example by creating your own new predictive attribute: family_size
# Create train_two with the newly defined feature
train_two = train.copy()
train_two["family_size"] = train_two["SibSp"] + train_two["Parch"] + 1
print(train_two["family_size"])
# Create a new feature set and add the new feature
features_three = train_two[["Pclass", "Sex", "Age", "Fare", "SibSp", "Parch", "family_size"]].values
# Define the tree classifier, then fit the model
my_tree_three = tree.DecisionTreeClassifier()
my_tree_three = my_tree_three.fit(features_three,target)
# Print the score of this decision tree
print(my_tree_three.score(features_three, target))
print(my_tree_three.feature_importances_)
补充3:使用新的模型算法--随机森林
A detailed study of Random Forests would take this tutorial a bit too far. However, since it's an often used machine learning technique, gaining a general understanding in Python won't hurt.
In layman's terms, the Random Forest technique handles the overfitting problem you faced with decision trees. It grows multiple (very deep) classification trees using the training set. At the time of prediction, each tree is used to come up with a prediction and every outcome is counted as a vote. For example, if you have trained 3 trees with 2 saying a passenger in the test set will survive and 1 says he will not, the passenger will be classified as a survivor. This approach of overtraining trees, but having the majority's vote count as the actual classification decision, avoids overfitting.
随机森林就是会生成很多决策树,然后每棵树都会对最终预测进行投票,取投票比例多的作为最终预测结果,避免了过渡拟合。
优点:
a. 在数据集上表现良好,两个随机性的引入,使得随机森林不容易陷入过拟合
b. 在当前的很多数据集上,相对其他算法有着很大的优势,两个随机性的引入,使得随机森林具有很好的抗噪声能力
c. 它能够处理很高维度(feature很多)的数据,并且不用做特征选择,对数据集的适应能力强:既能处理离散型数据,也能处理连续型数据,数据集无需规范化
d. 可生成一个Proximities=(pij)矩阵,用于度量样本之间的相似性: pij=aij/N, aij表示样本i和j出现在随机森林中同一个叶子结点的次数,N随机森林中树的颗数
e. 在创建随机森林的时候,对generlization error使用的是无偏估计
f. 训练速度快,可以得到变量重要性排序(两种:基于OOB误分率的增加量和基于分裂时的GINI下降量
g. 在训练过程中,能够检测到feature间的互相影响
h. 容易做成并行化方法
i. 实现比较简单
# Import the `RandomForestClassifier`
from sklearn.ensemble import RandomForestClassifier
# We want the Pclass, Age, Sex, Fare,SibSp, Parch, and Embarked variables
features_forest = train[["Pclass", "Age", "Sex", "Fare", "SibSp", "Parch", "Embarked"]].values
# Building and fitting my_forest
forest = RandomForestClassifier(max_depth = 10, min_samples_split=2, n_estimators = 100, random_state = 1)
my_forest = forest.fit(features_forest, target)
# Print the score of the fitted random forest
print(my_forest.score(features_forest, target))
# Compute predictions on our test set features then print the length of the prediction vector
test_features = test[["Pclass", "Age", "Sex", "Fare", "SibSp", "Parch", "Embarked"]].values
pred_forest = my_forest.predict(test_features)
print(len(pred_forest))
总结:
把整个数据分析的流程走了一遍,由于本身的数据质量较高,业务场景相对简单,所以模型预测效果好,没有遇到实际问题。
另外,本次使用的建模和训练模型的过程都是基于sklearn,直接调用方法,如果有时间的话,最好还是自己把建模的过程用python实现一遍,这样可以更加好的理解决策树模型。
还有就是以后可以学习一下可视化数据的方法,比如应用matplotlib去对数据有一个直观的了解和观测,像这样: