本文共 8554 字,大约阅读时间需要 28 分钟。
Ensemble Model
For clasisfication problem the ensemble model is very effective. Such as the situation of the Image recognition via deep learning.(black box)
For a grade system, we use the GBDT or XGBoost etc.
In engineering field, the Interpretable is very important,since we could determine the problem once we meet an issue.
How to build an ensemble model? Bagging and Boosting
Bagging: Random forest
Boosting: GBDT, XGBoost
We calculate the average value from all of the predictions from the models
We use the variance / standard deviation to evaluate the stability of the model
from the example above,we know that the model will become more stable
Random Forest
Bagging is a framwork for ensemble model
The random forest using multiple decision trees for the final predictions
It also can be used for regression problem(mean value)
Build the random forest
if we train the decision trees with big correlation, then the performance of random forest will not be very good
The diversity is the most important property of the random forest
1) Randomization of the training sample, it means that we choose diffferent part of the training data for each decision tree of the random forest
sample with replacement
we could also randomize the features. For example, if we have 100 features, we choose 10 from 100 randomly, then we build the decision tree via the 10 features.
Overfitting of random forest
Hyperparameter of random forest:
n_estimators: the number of decision trees we used. The more decision trees, the more training time of the random forest.
criterion: how to choose the features for the current node. or the measure the quality of the current split. gini, or entropy.
max_depth: the maximum depth of the decison tree.
min_samples_split, min_samples_leaf: control the number of the leaves
max_features: the number of features to consider when looking for the best split
An example:
# import the data setfrom sklearn.datasets import load_digits# import random forest classifierfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.model_selection import train_test_splitimport pandas as pdimport numpy as np# import datadigits = load_digits()X = digits.datay = digits.targetX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)# create the random forest classifierclf = RandomForestClassifier(n_estimators=400, criterion='entropy', max_depth = 5, min_samples_split = 3, max_features = 'sqrt', random_state = 0)clf.fit(X_train, y_train)print("Accuracy in train data set is: %.2f, in the test data set is %.2f" %(clf.score(X_train, y_train), clf.score(X_test, y_test)))
output:
Accuracy in train data set is: 0.98, in the test data set is 0.95
Another Demo:
prediction for turnover rate
# Turnover rate demo# import packageimport pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport matplotlib as matplot#%matplotlib inlinefrom sklearn.model_selection import train_test_split# read data as pandas dataframedf = pd.read_csv('HR_comma_sep.csv', index_col = None)# check that there is any data missingprint (df.isnull().any(), '\n\n')# print some dataprint (df.head(), "\n\n")
satisfaction_level last_evaluation number_project average_montly_hours \0 0.38 0.53 2 157 1 0.80 0.86 5 262 2 0.11 0.88 7 272 3 0.72 0.87 5 223 4 0.37 0.52 2 159 time_spend_company Work_accident left promotion_last_5years sales \0 3 0 1 0 sales 1 6 0 1 0 sales 2 4 0 1 0 sales 3 5 0 1 0 sales 4 3 0 1 0 sales salary 0 low 1 medium 2 medium 3 low 4 low
# rename the columndf = df.rename(columns = {'satisfaction_level' : 'satisfaction', 'last_evaluation' : 'evaluation', 'number_project' : 'projectCount', 'average_montly_hours' : 'averageMonthlyHours', 'time_spend_company' : 'yearsAtCompany', 'Work_accident' : 'workAccident', 'promotion_last_5years' : 'promotion', 'sales' : 'department', 'left' : 'turnover' })# move the label to the first columnfront = df['turnover']df.drop(labels=['turnover'], axis = 1, inplace = True)df.insert(0, 'turnover', front)#df.head()# calculate the turnover rateturnover_rate = df.turnover.value_counts() / len(df)print ("the turnover rate is: %.2f\n\n" % turnover_rate[1])# print the describe() infoprint(df.describe(), "\n\n")
turnover satisfaction evaluation projectCount \count 12504.000000 12504.000000 12504.000000 12504.000000 mean 0.200256 0.621834 0.716446 3.803503 std 0.400208 0.245010 0.169745 1.196592 min 0.000000 0.090000 0.360000 2.000000 25% 0.000000 0.450000 0.560000 3.000000 50% 0.000000 0.650000 0.720000 4.000000 75% 0.000000 0.820000 0.870000 5.000000 max 1.000000 1.000000 1.000000 7.000000 averageMonthlyHours yearsAtCompany workAccident promotion count 12504.000000 12504.000000 12504.000000 12504.000000 mean 200.721769 3.385717 0.149472 0.016555 std 49.341169 1.321437 0.356568 0.127601 min 96.000000 2.000000 0.000000 0.000000 25% 157.000000 3.000000 0.000000 0.000000 50% 200.000000 3.000000 0.000000 0.000000 75% 244.000000 4.000000 0.000000 0.000000 max 310.000000 10.000000 1.000000 1.000000
# convert the string value into integerdf['department'] = df['department'].astype('category').cat.codesdf['salary'] = df['salary'].astype('category').cat.codes# split the train / test data settarget_name = 'turnover'X = df.drop('turnover', axis = 1)y = df[target_name]# the stratify = y means that the turnover rate equal to the turnover rate in the datasetX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.15, random_state = 123, stratify = y)# Now, time to train the datafrom sklearn.metrics import classification_reportfrom sklearn.ensemble import RandomForestClassifierfrom sklearn import treefrom sklearn.tree import DecisionTreeClassifier# train the decision treedtree = tree.DecisionTreeClassifier( criterion = 'entropy', #max_depth = 3, # constraint the depth of the tree to prevent from overfitting min_weight_fraction_leaf = 0.01 # using the % rate to set the examples of the node )dtree = dtree.fit(X_train, y_train)print("\n\n ---Decision Tree---")print(classification_report(y_test, dtree.predict(X_test)))
---Decision Tree--- precision recall f1-score support 0 0.97 0.98 0.98 1500 1 0.93 0.89 0.91 376 accuracy 0.96 1876 macro avg 0.95 0.94 0.94 1876weighted avg 0.96 0.96 0.96 1876
# train the random forestrf = RandomForestClassifier( criterion = 'entropy', n_estimators = 1000, max_depth = None, # prevent from over fitting, None means that no limitation min_samples_split = 10, # at least number of nodes for the next split #min_weight_fraction_leaf = 0.02 # define the number of sample of the leaf node to prevent from overfitting )rf.fit(X_train, y_train)print("\n\n ---Random Forest---")print(classification_report(y_test, rf.predict(X_test)))
---随机森林--- precision recall f1-score support 0 0.98 1.00 0.99 1500 1 0.99 0.90 0.94 376 accuracy 0.98 1876 macro avg 0.98 0.95 0.96 1876weighted avg 0.98 0.98 0.98 1876
转载地址:http://hpbg.baihongyu.com/