在这一周的学习当中,我进行对UCSD的传感器数据集进行了分析和学习,我对我学习和尝试的过程,进行复盘
我们首先先从需要解决的问题入手:
我们需要对数据进行了解和清洗
清洗完成后我们需要对其做一个分类的问题
观察数据 根据官网的描述,我们得知如下: ExtraSensory 数据集由 UCSD 下 Yonatan Vaizman 和 Katherine Ellis 收集, 由手机 APP –the ExtraSensory App进行收集,收集的信息为手机各类传感器的数据和此时的人体状态等一些数据。 该数据集有 60个’csv.gz’文件,文件的命名格式为[UUID].features_label.csv.gz。 UUID为每个用户独有的 ID, 使用 gzip 进行压缩。
我们打开其中一个UUID为 1155FF54-63D3-4AB2-9863-8385D0BD0A13 的单个数据集 进行分析,我们可以得知如下:
我们对其进行具体的观察,我们得知在该数据集中以timestamp作为主键进行排序,拥有225个feature,和51个label,和一个sourelabel(此不作为label进行学习)
我当时第一个想法是将csv文件读入后,将timestamp、feature、label三部分进行划分,单独取出来
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 # parse_header_of_csv函数将feature和label进行区分归类 # parse_body_of_csv函数将feature和label的具体数据进行分割存储 # read_user_data函数将csv文件读入 def parse_header_of_csv(csv_str): headline = csv_str[:csv_str.index('\n')] column = headline.split(",") # 进行assert测试 assert column[0] == 'timestamp' assert column[-1] == 'label_source' # 找到label开始的位置 for (ci,col) in enumerate(column): if col.startswith("label:"): first_start_lind = ci break pass feature_names = column[1:first_start_lind] label_names = column[first_start_lind:-1] ##去除多余无效的字符 for (li,label) in enumerate(label_names): # assert assert label.startswith("label:") label_names[li] = label.replace('label:',"") pass return (feature_names, label_names) def parse_body_of_csv(csv_str,n_feature): full_body = np.loadtxt(StringIO(csv_str),delimiter=',',skiprows=1) # 数据的主键为timestamp timestamps = full_body[:,0].astype(int) # 将特征和标签分开,前面的是特征即传感器 X = full_body[:,1:(n_feature+1)] #分离出所有label数据 trinary_labels_mat = full_body[:,(n_feature+1):-1] M = np.isnan(trinary_labels_mat)# 将其进行判断有哪些是nan值 Y = np.where(M,0,trinary_labels_mat)>0.## 进行判断哪里有Nan值则将其转化为0,则保留原来的数值 ##并将其转化为布尔值 return (X,Y,M,timestamps) # 输出feature数据矩阵X,label数据矩阵Y,label的缺失数据分布矩阵M,feature_names,label_names def read_user_data(uiud): user_data_file = '%s.features_labels.csv.gz'%uuid with gzip.open(user_data_file,'r') as fid: csv_str = fid.read() csv_str = csv_str.decode(encoding = 'utf-8') pass (feature_names,label_names) = parse_header_of_csv(csv_str) n_feature = len(feature_names) (X,Y,M,timestamps) = parse_body_of_csv(csv_str, n_feature) return (X,Y,M,timestamps,feature_names,label_names)
在这段代码中,主要有三个函数 def read_user_data(uiud)读入用户数据,并且其中将读取数据的过程再次抽离出两个函数进行抽象。分别是:parse_header_of_csv()、 parse_body_of_csv().
值得注意的是在这里,我们使用的Python,Python在IO输入的时候和Python2有所不同,我们进行读取之后需要对其进行decode()操作,encoding为utf-8
parse_header_of_csv(csv_str)函数:由于我们直接将csv文件整个进行输入,所以我们直接将其作为一个大的str进行分析,我们通过判断\n的位置,进行对heading(column)的截取。为了更为严谨,我们在其中插入了assert函数进行判断,截取之后再使用enumerate对column进行从新标号,判断label开始的位置,得到位置之后在依旧使用enumrate对其进行分析,对label标签进行处理,return (feature_names, label_names)
对feature部分进行整理分析 parse_body_of_csv(csv_str,n_feature)函数:我们使用使用numpy中的loadtxt进行读取,以StringIo进行读取,在将feature和label进行分割。
1 2 3 4 5 6 #分离出所有label数据 trinary_labels_mat = full_body[:,(n_feature+1):-1] M = np.isnan(trinary_labels_mat)# 将其进行判断有哪些是nan值 Y = np.where(M,0,trinary_labels_mat)>0. ##进行判断哪里有Nan值则将其转化为0,则保留原来的数值 ##并将其转化为布尔值
由上我们得到状态存在矩阵M,同时使用where函数对其进行从新置换。
return (X,Y,M,timestamps) 1 2 3 # 观察feature数据中每列的nan值分布情况 X_df.count() X_df.shape[0] - X_df.count()
我们观察feature中每一列feature中nan的分布情况
我们可以发现还是有部分的feature列整个feature是缺失的 对于这种现象我们对其进行处理原则为若某个feature列中feature的个数缺失值占比超过该列的50%则将其进行删除
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 X_df.columns = feature_names Y_df.columns = label_names##为他们增加具体的列名 cleaning_data = pd.concat([X_df,Y_df], axis=1) ## 观察Nan值分布之后若nan值超过百分之50就把这列删掉 def drop_col(df, col_name, cutoff=0.5): n = len(df) cnt = df[col_name].count() if (float(cnt) / n) < cutoff: df.drop(col_name, axis=1, inplace=True) cleaning_data_heading = list(np.array(cleaning_data.columns)) for name in cleaning_data_heading: drop_col(cleaning_data, name, cutoff=0.5)
对label部分进行处理和分析 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 # 根据论文中的要求将某些label名称进行进一步修缮 def get_label_pretty_name(label): if label == 'FIX_walking': return 'Walking' if label == 'FIX_running': return 'Running' if label == 'LOC_main_workplace': return 'At main workplace' if label == 'OR_indoors': return 'Indoors' if label == 'OR_outside': return 'Outside' if label == 'LOC_home': return 'At home' if label == 'FIX_restaurant': return 'At a restaurant' if label == 'OR_exercise': return 'Exercise' if label == 'LOC_beach': return 'At the beach' if label == 'OR_standing': return 'Standing' if label == 'WATCHING_TV': return 'Watching TV' if label.endswith('_'): label = label[:-1] + ')' pass label = label.replace('__',' (').replace('_',' ') label = label[0] + label[1:].lower() label = label.replace('i m','I\'m') return label
根据数据集说明,该数据集以timestamp进行排序,每个例子以60s进行分割,即1分钟一次,我们进行统计每个label状态下所占用时间的统计情况。
1 2 3 4 5 6 7 8 9 10 11 12 13 # 统计每个label上用户占用的时间 n_examples_per_label = np.sum(Y, axis=0) labels_and_counts = zip(label_names,n_examples_per_label) label_names_fix = [] label_names_clean =[] for (label,count) in labels_and_counts: label_names_fix.append(get_label_pretty_name(label)) print ("%s - %d minutes" % (get_label_pretty_name(label),count)) if count >=10: label_names_clean.append(label) ##label = get_label_pretty_name(label) pass
在这里我们也进行处理,只有超过10min的保留在label_names_clean 中。
根据论文,我们将feature还原回sensor(包括一些自己定的分类)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 # 将feature进行分类还原成sensor def get_sensor_names_from_features(feature_names): feat_sensor_names = np.array([None for feat in feature_names]); for (fi,feat) in enumerate(feature_names): if feat.startswith('raw_acc'): feat_sensor_names[fi] = 'Acc' pass elif feat.startswith('proc_gyro'): feat_sensor_names[fi] = 'Gyro' pass elif feat.startswith('raw_magnet'): feat_sensor_names[fi] = 'Magnet' pass elif feat.startswith('watch_acceleration'): feat_sensor_names[fi] = 'WAcc' pass elif feat.startswith('watch_heading'): feat_sensor_names[fi] = 'Compass' pass elif feat.startswith('location'): feat_sensor_names[fi] = 'Loc' pass elif feat.startswith('location_quick_features'): feat_sensor_names[fi] = 'Loc' pass elif feat.startswith('audio_naive'): feat_sensor_names[fi] = 'Aud'; pass elif feat.startswith('audio_properties'): feat_sensor_names[fi] = 'AP'; pass elif feat.startswith('discrete'): feat_sensor_names[fi] = 'PS' pass elif feat.startswith('lf_measurements'): feat_sensor_names[fi] = 'LF' pass else: raise ValueError("!!! Unsupported feature name: %s" % feat) pass return feat_sensor_names
1 2 3 4 # 展示 for (fi,feature) in enumerate(feature_names): print("%4d) %s %s" % (fi,feat_sensor_names[fi].ljust(10),feature))## ljust指的是对于右对齐 pass
我们进行分析展示如下
进行训练 在这次的学习当中我是用的是sklearn进行对数据的进行训练 代码如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 import sklearn.linear_model from sklearn.neighbors import KNeighborsClassifier from sklearn import cross_validation # 进行选择sensor作为训练数据 def project_features_to_selected_sensors(X,feat_sensor_names,sensors_to_use): use_feature = np.zeros(len(feat_sensor_names),dtype=bool)# 制造出和feature那么长度相等的bool序列 for sensor in sensors_to_use: is_from_sensor = (feat_sensor_names == sensor)#判断是否 use_feature = np.logical_or(use_feature,is_from_sensor)## 选择出 pass X = X[:,use_feature] return X # 进行数据的归一化 def estimate_standardization_params(X_train): mean_vec = np.nanmean(X_train,axis=0) std_vec = np.nanstd(X_train,axis=0) return (mean_vec,std_vec) def standardize_features(X,mean_vec,std_vec): X_centralized = X - mean_vec.reshape((1,-1)) normalizers = np.where(std_vec > 0., std_vec, 1.).reshape((1,-1)) X_standard = X_centralized / normalizers return X_standard def train_model(X_train,Y_train,M_train,feat_sensor_names,label_names,sensors_to_use,target_label): X_train = project_features_to_selected_sensors(X_train,feat_sensor_names,sensors_to_use) print("== Projected the features to %d features from the sensors: %s" % (X_train.shape[1],', '.join(sensors_to_use))) ## 将其特征进行尽量的标准化 (mean_vec,std_vec) = estimate_standardization_params(X_train) X_train = standardize_features(X_train,mean_vec,std_vec) #选择需要进行测试的label label_ind = label_names.index(target_label) y = Y_train[:,label_ind] missing_label = M_train[:,label_ind] existing_label = np.logical_not(missing_label) # 因为有些label不存在例子,所以我们选择是只选择拥有该label的例子 X_train = X_train[existing_label,:] y = y[existing_label] #为了将其进行更好的分析,我们将nan值进行直接赋值为0 X_train[np.isnan(X_train)] = 0. X_train,X_test,y_train,y_test = cross_validation.train_test_split(X_train,y,test_size=0.3,random_state=4) print("== Training with %d examples. For label '%s' we have %d positive and %d negative examples." % \ (len(y_train),get_label_pretty_name(target_label),sum(y_train),sum(np.logical_not(y_train))) ) # 进行训练 lr_model = sklearn.linear_model.LogisticRegression(class_weight='balanced') lr_model.fit(X_train,y_train) # 进行声明 model = {\ 'sensors_to_use':sensors_to_use,\ 'target_label':target_label,\ 'mean_vec':mean_vec,\ 'std_vec':std_vec,\ 'lr_model':lr_model\ } return model
对数据进行归一化 1 2 3 4 5 6 7 8 9 10 11 # 进行数据的归一化 def estimate_standardization_params(X_train): mean_vec = np.nanmean(X_train,axis=0) std_vec = np.nanstd(X_train,axis=0) return (mean_vec,std_vec) def standardize_features(X,mean_vec,std_vec): X_centralized = X - mean_vec.reshape((1,-1)) normalizers = np.where(std_vec > 0., std_vec, 1.).reshape((1,-1)) X_standard = X_centralized / normalizers return X_standard
为了更好的进行学习,我们对数据进行归一化 归一化流程如下:
计算出每一行例子的平均值
计算出每一行例子的标准差
将X的每一行减去平均值得到X_centralized
对std进行判断,若>0,则将其保留否则转化为1得到normalizers
将X_centralized/normalizers 得到X_standard
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 def train_model(X_train,Y_train,M_train,feat_sensor_names,label_names,sensors_to_use,target_label): X_train = project_features_to_selected_sensors(X_train,feat_sensor_names,sensors_to_use) print("== Projected the features to %d features from the sensors: %s" % (X_train.shape[1],', '.join(sensors_to_use))) ## 将其特征进行尽量的标准化 (mean_vec,std_vec) = estimate_standardization_params(X_train) X_train = standardize_features(X_train,mean_vec,std_vec) #选择需要进行测试的label label_ind = label_names.index(target_label) y = Y_train[:,label_ind] missing_label = M_train[:,label_ind] existing_label = np.logical_not(missing_label) # 因为有些label不存在例子,所以我们选择是只选择拥有该label的例子 X_train = X_train[existing_label,:] y = y[existing_label] #为了将其进行更好的分析,我们将nan值进行直接赋值为0 X_train[np.isnan(X_train)] = 0. X_train,X_test,y_train,y_test = cross_validation.train_test_split(X_train,y,test_size=0.3,random_state=4) print("== Training with %d examples. For label '%s' we have %d positive and %d negative examples." % \ (len(y_train),get_label_pretty_name(target_label),sum(y_train),sum(np.logical_not(y_train))) ) # 进行训练 lr_model = sklearn.linear_model.LogisticRegression(class_weight='balanced') lr_model.fit(X_train,y_train) # 进行声明 model = {\ 'sensors_to_use':sensors_to_use,\ 'target_label':target_label,\ 'mean_vec':mean_vec,\ 'std_vec':std_vec,\ 'lr_model':lr_model\ } return model
我们对函数进行分析
1 2 3 4 5 6 7 8 9 10 # 进行选择sensor作为训练数据 def project_features_to_selected_sensors(X,feat_sensor_names,sensors_to_use): use_feature = np.zeros(len(feat_sensor_names),dtype=bool)# 制造出和feature那么长度相等的bool序列 for sensor in sensors_to_use: is_from_sensor = (feat_sensor_names == sensor)#判断是否 use_feature = np.logical_or(use_feature,is_from_sensor)## 选择出 pass X = X[:,use_feature] return X
这个函数最有意思了
1 X_train,X_test,y_train,y_test = cross_validation.train_test_split(X_train,y,test_size=0.3,random_state=4)
我们使用sklearn的交叉验证进行分割数据集,自己的注意的是这里random_state相同的时候分割出来的数据集是一样的
1 2 3 4 5 6 7 8 9 model = {\ 'sensors_to_use':sensors_to_use,\ 'target_label':target_label,\ 'mean_vec':mean_vec,\ 'std_vec':std_vec,\ 'lr_model':lr_model\ } return model
当返回的参数过多的时候做成字典会比较好
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 def test_model(X_test,Y_test,M_test,timestamps,feat_sensor_names,label_names,model): X_test = project_features_to_selected_sensors(X_test,feat_sensor_names,model['sensors_to_use']) print("== Projected the features to %d features from the sensors: %s" % (X_test.shape[1],', '.join(model['sensors_to_use']))); # 标准化的方式来标准化这些数据 X_test = standardize_features(X_test,model['mean_vec'],model['std_vec']) label_ind = label_names.index(model['target_label']) y = Y_test[:,label_ind] missing_label = M_test[:,label_ind] existing_label = np.logical_not(missing_label) X_test = X_test[existing_label,:] y = y[existing_label] timestamps = timestamps[existing_label] X_test[np.isnan(X_test)] = 0. X_train,X_test,y_train,y_test = cross_validation.train_test_split(X_train,y,test_size=0.3,random_state=4) print("== Training with %d examples. For label '%s' we have %d positive and %d negative examples." % \ (len(y_train),get_label_pretty_name(target_label),sum(y_train),sum(np.logical_not(y_train))) ) # 进行预测 y_pred = model['lr_model'].predict(X_test) # Naive accuracy (correct classification rate): accuracy = np.mean(y_pred == y_test) # (学习自统计学习方法) TP:正对正 FN:正对负 FP:负对正 TN:负对负 # TP:正确的肯定数目 # FN:漏报,没有找到正确匹配的数目 # FP:误报,没有的匹配不正确 # TN:正确拒绝的非匹配数目 tp = np.sum(np.logical_and(y_pred,y_test)) tn = np.sum(np.logical_and(np.logical_not(y_pred),np.logical_not(y_test))) fp = np.sum(np.logical_and(y_pred,np.logical_not(y_test))) fn = np.sum(np.logical_and(np.logical_not(y_pred),y_test)) # 召回率 # Sensitivity (=recall=true positive rate) and Specificity (=true negative rate): sensitivity = float(tp) / (tp+fn)# 召回率 specificity = float(tn) / (tn+fp)# balanced_accuracy = (sensitivity + specificity) / 2. precision = float(tp) / (tp+fp) # 精确率 F1 = 2*float(tp)/(2*tp+fp+fn)# F1值 print("-"*10); print('Accuracy: %.2f' % accuracy) print('Sensitivity (TPR): %.2f' % sensitivity) print('Specificity (TNR): %.2f' % specificity) print('Balanced accuracy: %.2f' % balanced_accuracy) print('Precision: %.2f' % precision) print('F1: %.2f' % F1) print("-"*10) model2 = {'accuracy':accuracy,'sensitivity (TPR)':sensitivity,'specificity (TNR)':specificity,'balanced accuracy':balanced_accuracy,'precision':precision,'F1':F1} return model2
验证当中我们使用如上的参数进行验证
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 sensors_to_use = ['Acc','Gyro','Magnet'] Accuracy = [] Sensitivity = [] Specificity = [] Balanced_accuracy = [] Precision = [] F1 = [] for label in label_names_clean: target_label = label# 原始标签是FIX_walking model = train_model(X,Y,M,feat_sensor_names,label_names,sensors_to_use,target_label) model2 = test_model(X,Y,M,timestamps,feat_sensor_names,label_names,model) Accuracy.append(model2['accuracy']) Sensitivity.append(model2['sensitivity (TPR)']) Specificity.append(model2['specificity (TNR)']) Balanced_accuracy.append(model2['balanced accuracy']) Precision.append(model2['precision']) F1.append(model2['F1'])
我们将所有label都跑一边
结论 由于这个数据集正负标签数量并不平衡,所以单看某些指标是不合适的,下一步我要将每个label的ROC曲线进行绘制