专注细节
努力进步

拍拍贷风控预测比赛分享

前记

前一段时间参加了拍拍贷的一个风控预测比赛,从平均400个数据维度评估用户当前的信用状态,给每个借款人打出当前状态的信用分,具体比赛可以见拍拍贷魔镜杯 本人初赛46名,复赛最终结果23名,在比赛当中,由于工作之后项目压力比较大,时间不足,没有详细分析数据,只是简单地对原始数据做了一些基本的处理,比较倾向于model的调优,和一些大神的对原始数据详细的分析来看还有特别大的差距,以后的比赛会尽量话更多时间在数据处理方面,虽然这是我不太擅长的,但是要想出成绩必须要在这部分多花时间。

所有的代码都开源到了我的github拍拍贷代码,可以看看我代码里面的基本的字段处理、特征的生成方式以及对xgboost模型的封装来使用sklearn的调参方法。

比赛数据分析

拍拍贷比赛数据主要包括master、log_info、userupdate_info三部分数据,分别是: Master 每一行代表一个样本(一笔成功成交借款),每个样本包含200多个各类字段。 idx:每一笔贷款的unique key,可以与另外2个文件里的idx相匹配。 UserInfo_:借款人特征字段 WeblogInfo_:Info网络行为字段 Education_Info:学历学籍字段 ThirdParty_Info_PeriodN_:第三方数据时间段N字段 SocialNetwork_*:社交网络字段 LinstingInfo:借款成交时间 Target:违约标签(1 = 贷款违约,0 = 正常还款)。测试集里不包含target字段。

Log_info 借款人的登陆信息。 ListingInfo:借款成交时间 LogInfo1:操作代码 LogInfo2:操作类别 LogInfo3:登陆时间 idx:每一笔贷款的unique key

Userupdate_info 借款人修改信息 ListingInfo1:借款成交时间 UserupdateInfo1:修改内容 UserupdateInfo2:修改时间 idx:每一笔贷款的unique key

特征处理

clean_data.py – category变量除了UserInfo_2,UserInfo_4直接做 factorize,因为在tree类的model,不需要做 dummies处理; – UserInfo_2,UserInfo_4是城市信息,不用 factorize处理,取两列城市并集,然后做映射; – 从baidu上拉出城市的经纬度信息,这样可以找 出对应城市的经纬度信息,能够解决一些在经纬 度上相关联的数据问题; – 增加字段UserInfo_2_4_01,为0表示UserInfo_2 与UserInfo_4相等,反之,为1; – 使用从city_ratio.py生产个UserInfo_2的target为0 的数量以及占比和UserInfo_4中target为0的数量与 占比;

'''
Coding Just for Fun
Created by burness on 16/3/18.
'''
import pandas as pd
import numpy as np
from env_variable import *
def clean_data(train_master_file, test_master_file, debug, save_data, city_common):
    if debug:
        train_master_data = pd.read_csv(second_train_master_file, nrows = 500,encoding='gb18030')
        train_master_data['tag'] = 1
        test_master_data = pd.read_csv(second_test_master_file, nrows = 500,encoding='gb18030')
        test_master_data['tag'] = 0
        # target =2
        test_master_data['target'] = 2
        all_master_data = train_master_data.append(test_master_data)
    else:
        train_master_data = pd.read_csv(second_train_master_file,encoding='gb18030')
        train_master_data['tag'] = 1
        test_master_data = pd.read_csv(second_test_master_file,encoding='gb18030')
        test_master_data['tag'] = 0
        test_master_data['target'] = 2
        all_master_data = train_master_data.append(test_master_data)
    # find the category columns
    category_list = ["UserInfo_2", "UserInfo_4", "UserInfo_7", "UserInfo_8", "UserInfo_19", "UserInfo_20", "UserInfo_1", \
                     "UserInfo_3", "UserInfo_5", "UserInfo_6", "UserInfo_9", "UserInfo_2", "UserInfo_4", \
                     "UserInfo_7", "UserInfo_8", "UserInfo_19", "UserInfo_20", "UserInfo_11", "UserInfo_12", "UserInfo_13", \
                     "UserInfo_14", "UserInfo_15", "UserInfo_16", "UserInfo_18", "UserInfo_21", "UserInfo_22", "UserInfo_23", \
                     "UserInfo_24", "Education_Info1", "Education_Info2", "Education_Info3", "Education_Info4", \
                     "Education_Info5", "Education_Info6", "Education_Info7", "Education_Info8", "WeblogInfo_19", \
                     "WeblogInfo_20", "WeblogInfo_21", "SocialNetwork_1", "SocialNetwork_2", "SocialNetwork_7", \
                     "ListingInfo", "SocialNetwork_12"]
    # want to see the UserInfo_2 and UserInfo_4 add the feature whether UserInfo_2 and UserInfo_4 is equal
    city_category_list = ["UserInfo_2", "UserInfo_4"]
    user_info_2 = all_master_data['UserInfo_2'].unique()
    user_info_4 = all_master_data['UserInfo_4'].unique()
    ret_list = list(set(user_info_2).union(set(user_info_4)))
    ret_list_dict = dict(zip(ret_list,range(len(ret_list))))
    print ret_list_dict


    # print all_master_data[['UserInfo_2','UserInfo_4']].head()
    all_master_data['UserInfo_2'] = all_master_data['UserInfo_2'].map(ret_list_dict)
    all_master_data['UserInfo_4'] = all_master_data['UserInfo_4'].map(ret_list_dict)

    for col in category_list:
        if city_common:
            if col in city_category_list:
                continue
            else:
                all_master_data[col] = pd.factorize(all_master_data[col])[0]
        else:
            all_master_data[col] = pd.factorize(all_master_data[col])[0]

    print all_master_data.shape
    city_lat_pd = pd.read_csv(city_geo_info_file,encoding='gb18030')
    # print city_lat_pd.head(200)
    city_lat_pd['UserInfo_2'] = city_lat_pd['city'].map(ret_list_dict)
    city_lat_pd = city_lat_pd.drop('city',axis=1)
    all_master_data = all_master_data.merge(city_lat_pd,on='UserInfo_2',how='left')
    city_lat_pd['UserInfo_4'] = city_lat_pd['UserInfo_2']
    city_lat_pd = city_lat_pd.drop('UserInfo_2',axis=1)
    all_master_data = all_master_data.merge(city_lat_pd,on='UserInfo_4',how='left')
    print all_master_data.shape

    # add a feature whether the UserInfo_2 and UserInfo_4 are equal
    def is_equal(x):
        if x['UserInfo_2'] == x['UserInfo_4']:
            x['UserInfo_2_4_01'] = 0
        else:
            x['UserInfo_2_4_01'] = 1
        return x['UserInfo_2_4_01']

    all_master_data['UserInfo_2_4_01'] = all_master_data.apply(is_equal, axis=1)
    # print all_master_data[['UserInfo_2_4_01','UserInfo_2','UserInfo_4']].head()
    print all_master_data.shape
    # add the ratio\count\all_count of each UserInfo_2 and UserInfo_4
    userinfo_2_ratio_pd = pd.read_csv(second_userinfo_2_ratio)
    userinfo_4_ratio_pd = pd.read_csv(second_userinfo_4_ratio)
    print userinfo_2_ratio_pd.shape
    print userinfo_4_ratio_pd.shape
    # merge the userinfo_2_ratio_pd and userinfo_4_ratio_pd
    all_master_data = all_master_data.merge(userinfo_2_ratio_pd, on='UserInfo_2', how='left')
    all_master_data = all_master_data.merge(userinfo_4_ratio_pd, on='UserInfo_4', how='left')

    print all_master_data.shape


    # save the factorize
    if save_data:
        all_master_data.to_csv(second_save_master_factorize_file,index=None)
    # print all_master_data.shape
    # clean the -1
    all_master_data = all_master_data.replace(-1,np.nan)
    print all_master_data.shape
    if save_data:
        # all_master_data.to_csv(save_master_factorize_file_nan,index=None)
        all_master_data.to_csv(second_save_master_factorizeV2_file_nan,index=None)

    # # dummies
    # for col in category_list:
    #     temp = pd.get_dummies(all_master_data[col],prefix=col)
    #     all_master_data = pd.concat([all_master_data,temp], axis=1)
    # print all_master_data.shape
    # if save_data:
        # all_master_data.to_csv(save_master_factorize_file_nan_dummies,index=None)
        # all_master_data.to_csv(save_master_factorizeV2_file_nan_dummies,index=None)


if __name__ == '__main__':
    clean_data(second_train_master_file, second_test_master_file, debug = False, save_data = True, city_common = True)
    # clean_data(train_master_file,test_master_file,debug = False, save_data = True, city_common = True)

create_features.py 增加log和user update数据: – 登录的次数、频率、时间区间; – 用户更改信息的次数; – 增加用户修改信息如修改qq或者是否有车,则在 对应位置置1,增加约55维二值变量.

#-*-coding:utf-8-*-
'''
Coding Just for Fun
Created by burness on 16/3/19.
'''
import pandas as pd
from env_variable import *

train_log_file_pd = pd.read_csv(second_test_log_info_name, encoding='gb18030')
test_log_file_pd = pd.read_csv(second_test_log_info_name, encoding='gb18030')
all_log_info_pd = train_log_file_pd.append(test_log_file_pd)
# # all_log_info_name = '../dat a/all/log_info.csv'
# print all_log_info_pd.shape
#
# all_log_info_pd = pd.read_csv(all_log_info_name, encoding='gb18030')
all_log_info_pd['diff_days'] = all_log_info_pd['Listinginfo1'].astype('datetime64') - all_log_info_pd['LogInfo3'].astype('datetime64')
all_log_info_pd['diff_days'] = all_log_info_pd['diff_days'].astype(str).str.replace(' days 00:00:00.000000000','').astype(int)


all_log_info_pd['LogInfo1'] = all_log_info_pd['LogInfo1'].astype(str)
all_log_info_pd['LogInfo2'] = all_log_info_pd['LogInfo2'].astype(str)
all_log_info_pd['LogInfo1_2'] = all_log_info_pd[['LogInfo1','LogInfo2']].apply(lambda x: ','.join(x),axis=1)
# groupby Idx LogInfo1_2
all_log_info_final_pd = pd.DataFrame()

a= all_log_info_pd.groupby('Idx')['LogInfo1_2'].count()
diff_min = all_log_info_pd.groupby('Idx')['diff_days'].min()
diff_max = all_log_info_pd.groupby('Idx')['diff_days'].max()


freq = a/(1.0+diff_max-diff_min)

all_log_info_final_pd['count'] = a
all_log_info_final_pd['freq'] = freq
all_log_info_final_pd['diff_min'] = diff_min
all_log_info_final_pd['diff_max'] = diff_max
all_log_info_final_pd['period'] = diff_max-diff_min

# print all_log_info_final_pd.reset_index().head()
all_log_info_final_pd = all_log_info_final_pd.reset_index()
# print all_log_info_final_pd.reset_index().head()
all_log_info_final_pd.to_csv(second_all_log_info_file,index=None, encoding='gb18030')




user_update_info_train_pd = pd.read_csv(second_train_update_log_file_name,encoding='gb18030')
user_update_info_test_pd = pd.read_csv(second_test_update_log_file_name, encoding='gb18030')
# user_update_info_train_pd['tag'] = 1
# user_update_info_test_pd['tag'] = 0
user_update_info_all_pd = user_update_info_train_pd.append(user_update_info_test_pd)
user_update_info_all_pd['UserupdateInfo1'] = user_update_info_all_pd['UserupdateInfo1'].str.lower()
user_update_info_all_pd['UserupdateInfo1'] = pd.factorize(user_update_info_all_pd['UserupdateInfo1'])[0]
# print user_update_info_all_pd.head()
update_count = user_update_info_all_pd.groupby('Idx')['UserupdateInfo1'].count()
# print update_count.head()
user_update_info_all_pd['update_diff_days'] = user_update_info_all_pd['ListingInfo1'].astype('datetime64')-user_update_info_all_pd['UserupdateInfo2'].astype('datetime64')
user_update_info_all_pd['update_diff_days'] = user_update_info_all_pd['update_diff_days'].astype(str).str.replace(' days 00:00:00.000000000','').astype(int)
# print user_update_info_all_pd.head()
update_diff_min = user_update_info_all_pd.groupby('Idx')['update_diff_days'].min()
update_diff_max = user_update_info_all_pd.groupby('Idx')['update_diff_days'].max()
update_freq = freq = update_count/(1.0+update_diff_max-update_diff_min)

all_update_info_final_pd = pd.DataFrame()
all_update_info_final_pd['update_count'] = update_count
all_update_info_final_pd['update_idff_min'] = update_diff_min
all_update_info_final_pd['update_idff_max'] = update_diff_max
all_update_info_final_pd['update_freq'] = freq
all_update_info_final_pd['update_period'] = update_diff_max-update_diff_min
all_update_info_final_pd = all_update_info_final_pd.reset_index()
# 增加那些字段信息有更改
# update_info_list = user_update_info_all_pd['UserupdateInfo1'].unique()
print user_update_info_all_pd.shape
Idx_userupdateInfo1 = user_update_info_all_pd[['Idx','UserupdateInfo1']].drop_duplicates()
print Idx_userupdateInfo1.shape
Idx_userupdateInfo1_pd = Idx_userupdateInfo1.assign(c = 1).set_index(["Idx", "UserupdateInfo1"]).unstack("UserupdateInfo1").fillna(0)
Idx_userupdateInfo1_pd = Idx_userupdateInfo1_pd.reset_index()
# print len(update_info_list)
print Idx_userupdateInfo1_pd.shape
print Idx_userupdateInfo1_pd.head(5)
Idx_userupdateInfo1_pd.to_csv(second_userupdate_file,index=False)
# do some process in the file
Idx_userupdateInfo1_pd = pd.read_csv(second_userupdate_file,header=None)
columns_list = ['Idx']
for i in range(55):
    temp_str = 'user_update_'+str(i)
    columns_list.append(temp_str)
Idx_userupdateInfo1_pd.columns = columns_list

print all_update_info_final_pd.shape
all_update_info_final_pd.to_csv(all_update_info_file,index=None, encoding='gb18030')
#
#
# # merge the log info and update info to the save_master_factorizeV2_file_nan
all_master_facotrizeV2_file_nan = pd.read_csv(second_save_master_factorizeV2_file_nan, encoding='gb18030')
print all_master_facotrizeV2_file_nan.shape
all_master_facotrizeV2_file_nan_log = all_master_facotrizeV2_file_nan.merge(all_log_info_final_pd, on='Idx',how='left')
all_master_facotrizeV2_file_nan_log_update = all_master_facotrizeV2_file_nan_log.merge(all_update_info_final_pd, on='Idx', how='left')
print all_master_facotrizeV2_file_nan_log_update.shape

# merge the userupdateinfor1_pd
all_master_facotrizeV2_file_nan_log_update = all_master_facotrizeV2_file_nan_log_update.merge(Idx_userupdateInfo1_pd,on='Idx',how='left')
print all_master_facotrizeV2_file_nan_log_update.shape
all_master_facotrizeV2_file_nan_log_update.to_csv(second_save_master_factorizeV2_file_nan_log_updateV2,index=None)

模型选择

最初采用三种模型,后来因为性能以及计算能力的问题,舍去了LR和network

  • lr_model.py:Logistic Regression模型来预测,效果不好,放弃;
  • network.py: 用keras封装的神经网络模型,机器太差调参跑不了,放弃;
  • XgboostClassifier.py: 将xgboost封装为sklearn pipeline支持的类,方便调参,且时间成本相对nn较低, 效果在初赛也比较好,故选择Xgboost作为model。 代码基本解释:
  • xgb_model.py: 查看对应的cv分数,初步判断num_boost为eta的一些初步取值范围;
  • xgb_feature_selection.py: 通过xgboost的feature importance观察那些feature的重要性更高,然后对 那一类特征做基本的特征组合处理(feature_ensemble.py);
  • xgb1_20_2.py RandomSearch 寻找xgboost最优参数;
  • xgb1_20_2_pos_weight.py 与xgb1_20_2.py效果一样,考虑不平衡样本数量,多了scale_pos_weight, 但是发现效果一般;
  • feature_ensemble.py 从xgb_feature_selection.py中选择一批importance比较高的特征值,然后做组合 特征计算。

最终训练流程图

流程图可以归纳为两个部分,一个是feature的生成:从原始数据中得到一些features,然后fit到xgboost,得到feature的 importance,然后对feature importance比较靠前的features做ensemble,相当于增加feature的维度数,最后fit到model,得到最终的预测结果。

总结

  • 特征工程太复杂,而有点枯燥,个人对这个兴趣不大,且机器不给力,很多方案没有有效验证;
  • 模型调参应该先根据xgb.cv进行粗调,盲目调参太费时间;
  • 特征选择策略对最终结果影响很大,这里耗费时间太多太多,单机每次尝试太费时间;
  • 使用多模型做ensemble处理能够有效防止单模型的带来的随机性问题,一般都能提高 AUC;
  • 以为是24号完成数据提交即可,前面只是一直在测试数据方案,最终数据提交没有等数 据cv跑出来后做ensemble,人工定的几个结果做的ensemble,且没有考虑local CV的score来做 ensemble的weight。

感谢群里面一些同学在拍拍贷比赛过程中的相关分享

分享到:更多 ()