关于对extrasenssoryy数据集的整理和反思

关于ExtraSensory 数据集的分析过程

在这一周的学习当中,我进行对UCSD的传感器数据集进行了分析和学习,我对我学习和尝试的过程,进行复盘

我们首先先从需要解决的问题入手:

  • 我们需要对数据进行了解和清洗
  • 清洗完成后我们需要对其做一个分类的问题

观察数据

根据官网的描述,我们得知如下:
ExtraSensory 数据集由 UCSD 下 Yonatan Vaizman 和 Katherine Ellis 收集, 由手机 APP –the
ExtraSensory App进行收集,收集的信息为手机各类传感器的数据和此时的人体状态等一些数据。
该数据集有 60个’csv.gz’文件,文件的命名格式为[UUID].features_label.csv.gz。 UUID为每个用户独有的 ID, 使用 gzip 进行压缩。

我们打开其中一个UUID为 1155FF54-63D3-4AB2-9863-8385D0BD0A13 的单个数据集
进行分析,我们可以得知如下:

我们对其进行具体的观察,我们得知在该数据集中以timestamp作为主键进行排序,拥有225个feature,和51个label,和一个sourelabel(此不作为label进行学习)

我当时第一个想法是将csv文件读入后,将timestamp、feature、label三部分进行划分,单独取出来

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
# parse_header_of_csv函数将feature和label进行区分归类
# parse_body_of_csv函数将feature和label的具体数据进行分割存储
# read_user_data函数将csv文件读入
def parse_header_of_csv(csv_str):
headline = csv_str[:csv_str.index('\n')]
column = headline.split(",")

# 进行assert测试
assert column[0] == 'timestamp'
assert column[-1] == 'label_source'

# 找到label开始的位置
for (ci,col) in enumerate(column):
if col.startswith("label:"):
first_start_lind = ci
break
pass

feature_names = column[1:first_start_lind]
label_names = column[first_start_lind:-1]

##去除多余无效的字符
for (li,label) in enumerate(label_names):
# assert
assert label.startswith("label:")
label_names[li] = label.replace('label:',"")
pass

return (feature_names, label_names)

def parse_body_of_csv(csv_str,n_feature):

full_body = np.loadtxt(StringIO(csv_str),delimiter=',',skiprows=1)

# 数据的主键为timestamp
timestamps = full_body[:,0].astype(int)

# 将特征和标签分开,前面的是特征即传感器
X = full_body[:,1:(n_feature+1)]

#分离出所有label数据
trinary_labels_mat = full_body[:,(n_feature+1):-1]
M = np.isnan(trinary_labels_mat)# 将其进行判断有哪些是nan值
Y = np.where(M,0,trinary_labels_mat)>0.## 进行判断哪里有Nan值则将其转化为0,则保留原来的数值
##并将其转化为布尔值
return (X,Y,M,timestamps)

# 输出feature数据矩阵X,label数据矩阵Y,label的缺失数据分布矩阵M,feature_names,label_names
def read_user_data(uiud):
user_data_file = '%s.features_labels.csv.gz'%uuid

with gzip.open(user_data_file,'r') as fid:
csv_str = fid.read()
csv_str = csv_str.decode(encoding = 'utf-8')
pass

(feature_names,label_names) = parse_header_of_csv(csv_str)
n_feature = len(feature_names)
(X,Y,M,timestamps) = parse_body_of_csv(csv_str, n_feature)

return (X,Y,M,timestamps,feature_names,label_names)

在这段代码中,主要有三个函数
def read_user_data(uiud)读入用户数据,并且其中将读取数据的过程再次抽离出两个函数进行抽象。分别是:parse_header_of_csv()、 parse_body_of_csv().

值得注意的是在这里,我们使用的Python,Python在IO输入的时候和Python2有所不同,我们进行读取之后需要对其进行decode()操作,encoding为utf-8

parse_header_of_csv(csv_str)函数:由于我们直接将csv文件整个进行输入,所以我们直接将其作为一个大的str进行分析,我们通过判断\n的位置,进行对heading(column)的截取。为了更为严谨,我们在其中插入了assert函数进行判断,截取之后再使用enumerate对column进行从新标号,判断label开始的位置,得到位置之后在依旧使用enumrate对其进行分析,对label标签进行处理,return (feature_names, label_names)

对feature部分进行整理分析

parse_body_of_csv(csv_str,n_feature)函数:我们使用使用numpy中的loadtxt进行读取,以StringIo进行读取,在将feature和label进行分割。

1
2
3
4
5
6
#分离出所有label数据
trinary_labels_mat = full_body[:,(n_feature+1):-1]
M = np.isnan(trinary_labels_mat)# 将其进行判断有哪些是nan值
Y = np.where(M,0,trinary_labels_mat)>0.
##进行判断哪里有Nan值则将其转化为0,则保留原来的数值
##并将其转化为布尔值

由上我们得到状态存在矩阵M,同时使用where函数对其进行从新置换。

return (X,Y,M,timestamps)

1
2
3
# 观察feature数据中每列的nan值分布情况
X_df.count()
X_df.shape[0] - X_df.count()

我们观察feature中每一列feature中nan的分布情况

我们可以发现还是有部分的feature列整个feature是缺失的
对于这种现象我们对其进行处理原则为若某个feature列中feature的个数缺失值占比超过该列的50%则将其进行删除

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
X_df.columns = feature_names
Y_df.columns = label_names##为他们增加具体的列名

cleaning_data = pd.concat([X_df,Y_df], axis=1)

## 观察Nan值分布之后若nan值超过百分之50就把这列删掉
def drop_col(df, col_name, cutoff=0.5):
n = len(df)
cnt = df[col_name].count()
if (float(cnt) / n) < cutoff:
df.drop(col_name, axis=1, inplace=True)


cleaning_data_heading = list(np.array(cleaning_data.columns))
for name in cleaning_data_heading:
drop_col(cleaning_data, name, cutoff=0.5)

对label部分进行处理和分析

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# 根据论文中的要求将某些label名称进行进一步修缮
def get_label_pretty_name(label):
if label == 'FIX_walking':
return 'Walking'
if label == 'FIX_running':
return 'Running'
if label == 'LOC_main_workplace':
return 'At main workplace'
if label == 'OR_indoors':
return 'Indoors'
if label == 'OR_outside':
return 'Outside'
if label == 'LOC_home':
return 'At home'
if label == 'FIX_restaurant':
return 'At a restaurant'
if label == 'OR_exercise':
return 'Exercise'
if label == 'LOC_beach':
return 'At the beach'
if label == 'OR_standing':
return 'Standing'
if label == 'WATCHING_TV':
return 'Watching TV'

if label.endswith('_'):
label = label[:-1] + ')'
pass

label = label.replace('__',' (').replace('_',' ')
label = label[0] + label[1:].lower()
label = label.replace('i m','I\'m')
return label

根据数据集说明,该数据集以timestamp进行排序,每个例子以60s进行分割,即1分钟一次,我们进行统计每个label状态下所占用时间的统计情况。

1
2
3
4
5
6
7
8
9
10
11
12
13
# 统计每个label上用户占用的时间
n_examples_per_label = np.sum(Y, axis=0)
labels_and_counts = zip(label_names,n_examples_per_label)

label_names_fix = []
label_names_clean =[]
for (label,count) in labels_and_counts:
label_names_fix.append(get_label_pretty_name(label))
print ("%s - %d minutes" % (get_label_pretty_name(label),count))
if count >=10:
label_names_clean.append(label)
##label = get_label_pretty_name(label)
pass

在这里我们也进行处理,只有超过10min的保留在label_names_clean
中。

根据论文,我们将feature还原回sensor(包括一些自己定的分类)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# 将feature进行分类还原成sensor
def get_sensor_names_from_features(feature_names):
feat_sensor_names = np.array([None for feat in feature_names]);
for (fi,feat) in enumerate(feature_names):
if feat.startswith('raw_acc'):
feat_sensor_names[fi] = 'Acc'
pass
elif feat.startswith('proc_gyro'):
feat_sensor_names[fi] = 'Gyro'
pass
elif feat.startswith('raw_magnet'):
feat_sensor_names[fi] = 'Magnet'
pass
elif feat.startswith('watch_acceleration'):
feat_sensor_names[fi] = 'WAcc'
pass
elif feat.startswith('watch_heading'):
feat_sensor_names[fi] = 'Compass'
pass
elif feat.startswith('location'):
feat_sensor_names[fi] = 'Loc'
pass
elif feat.startswith('location_quick_features'):
feat_sensor_names[fi] = 'Loc'
pass
elif feat.startswith('audio_naive'):
feat_sensor_names[fi] = 'Aud';
pass
elif feat.startswith('audio_properties'):
feat_sensor_names[fi] = 'AP';
pass
elif feat.startswith('discrete'):
feat_sensor_names[fi] = 'PS'
pass
elif feat.startswith('lf_measurements'):
feat_sensor_names[fi] = 'LF'
pass
else:
raise ValueError("!!! Unsupported feature name: %s" % feat)

pass

return feat_sensor_names
1
2
3
4
# 展示
for (fi,feature) in enumerate(feature_names):
print("%4d) %s %s" % (fi,feat_sensor_names[fi].ljust(10),feature))## ljust指的是对于右对齐
pass

我们进行分析展示如下

进行训练

在这次的学习当中我是用的是sklearn进行对数据的进行训练
代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
    import sklearn.linear_model
from sklearn.neighbors import KNeighborsClassifier
from sklearn import cross_validation
# 进行选择sensor作为训练数据
def project_features_to_selected_sensors(X,feat_sensor_names,sensors_to_use):
use_feature = np.zeros(len(feat_sensor_names),dtype=bool)# 制造出和feature那么长度相等的bool序列
for sensor in sensors_to_use:
is_from_sensor = (feat_sensor_names == sensor)#判断是否
use_feature = np.logical_or(use_feature,is_from_sensor)## 选择出
pass

X = X[:,use_feature]
return X

# 进行数据的归一化
def estimate_standardization_params(X_train):
mean_vec = np.nanmean(X_train,axis=0)
std_vec = np.nanstd(X_train,axis=0)
return (mean_vec,std_vec)

def standardize_features(X,mean_vec,std_vec):
X_centralized = X - mean_vec.reshape((1,-1))
normalizers = np.where(std_vec > 0., std_vec, 1.).reshape((1,-1))
X_standard = X_centralized / normalizers
return X_standard


def train_model(X_train,Y_train,M_train,feat_sensor_names,label_names,sensors_to_use,target_label):
X_train = project_features_to_selected_sensors(X_train,feat_sensor_names,sensors_to_use)
print("== Projected the features to %d features from the sensors: %s" % (X_train.shape[1],', '.join(sensors_to_use)))

## 将其特征进行尽量的标准化
(mean_vec,std_vec) = estimate_standardization_params(X_train)
X_train = standardize_features(X_train,mean_vec,std_vec)

#选择需要进行测试的label
label_ind = label_names.index(target_label)
y = Y_train[:,label_ind]
missing_label = M_train[:,label_ind]
existing_label = np.logical_not(missing_label)


# 因为有些label不存在例子,所以我们选择是只选择拥有该label的例子
X_train = X_train[existing_label,:]
y = y[existing_label]


#为了将其进行更好的分析,我们将nan值进行直接赋值为0
X_train[np.isnan(X_train)] = 0.
X_train,X_test,y_train,y_test = cross_validation.train_test_split(X_train,y,test_size=0.3,random_state=4)

print("== Training with %d examples. For label '%s' we have %d positive and %d negative examples." % \
(len(y_train),get_label_pretty_name(target_label),sum(y_train),sum(np.logical_not(y_train))) )

# 进行训练
lr_model = sklearn.linear_model.LogisticRegression(class_weight='balanced')
lr_model.fit(X_train,y_train)

# 进行声明

model = {\
'sensors_to_use':sensors_to_use,\
'target_label':target_label,\
'mean_vec':mean_vec,\
'std_vec':std_vec,\
'lr_model':lr_model\
}

return model

对数据进行归一化

1
2
3
4
5
6
7
8
9
10
11
# 进行数据的归一化
def estimate_standardization_params(X_train):
mean_vec = np.nanmean(X_train,axis=0)
std_vec = np.nanstd(X_train,axis=0)
return (mean_vec,std_vec)

def standardize_features(X,mean_vec,std_vec):
X_centralized = X - mean_vec.reshape((1,-1))
normalizers = np.where(std_vec > 0., std_vec, 1.).reshape((1,-1))
X_standard = X_centralized / normalizers
return X_standard

为了更好的进行学习,我们对数据进行归一化
归一化流程如下:

  • 计算出每一行例子的平均值
  • 计算出每一行例子的标准差
  • 将X的每一行减去平均值得到X_centralized
  • 对std进行判断,若>0,则将其保留否则转化为1得到normalizers
  • 将X_centralized/normalizers 得到X_standard
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
def train_model(X_train,Y_train,M_train,feat_sensor_names,label_names,sensors_to_use,target_label):
X_train = project_features_to_selected_sensors(X_train,feat_sensor_names,sensors_to_use)
print("== Projected the features to %d features from the sensors: %s" % (X_train.shape[1],', '.join(sensors_to_use)))

## 将其特征进行尽量的标准化
(mean_vec,std_vec) = estimate_standardization_params(X_train)
X_train = standardize_features(X_train,mean_vec,std_vec)

#选择需要进行测试的label
label_ind = label_names.index(target_label)
y = Y_train[:,label_ind]
missing_label = M_train[:,label_ind]
existing_label = np.logical_not(missing_label)


# 因为有些label不存在例子,所以我们选择是只选择拥有该label的例子
X_train = X_train[existing_label,:]
y = y[existing_label]


#为了将其进行更好的分析,我们将nan值进行直接赋值为0
X_train[np.isnan(X_train)] = 0.
X_train,X_test,y_train,y_test = cross_validation.train_test_split(X_train,y,test_size=0.3,random_state=4)

print("== Training with %d examples. For label '%s' we have %d positive and %d negative examples." % \
(len(y_train),get_label_pretty_name(target_label),sum(y_train),sum(np.logical_not(y_train))) )

# 进行训练
lr_model = sklearn.linear_model.LogisticRegression(class_weight='balanced')
lr_model.fit(X_train,y_train)

# 进行声明

model = {\
'sensors_to_use':sensors_to_use,\
'target_label':target_label,\
'mean_vec':mean_vec,\
'std_vec':std_vec,\
'lr_model':lr_model\
}

return model

我们对函数进行分析

1
2
3
4
5
6
7
8
9
10
# 进行选择sensor作为训练数据
def project_features_to_selected_sensors(X,feat_sensor_names,sensors_to_use):
use_feature = np.zeros(len(feat_sensor_names),dtype=bool)# 制造出和feature那么长度相等的bool序列
for sensor in sensors_to_use:
is_from_sensor = (feat_sensor_names == sensor)#判断是否
use_feature = np.logical_or(use_feature,is_from_sensor)## 选择出
pass

X = X[:,use_feature]
return X

这个函数最有意思了

1
X_train,X_test,y_train,y_test = cross_validation.train_test_split(X_train,y,test_size=0.3,random_state=4)

我们使用sklearn的交叉验证进行分割数据集,自己的注意的是这里random_state相同的时候分割出来的数据集是一样的

1
2
3
4
5
6
7
8
9
model = {\
'sensors_to_use':sensors_to_use,\
'target_label':target_label,\
'mean_vec':mean_vec,\
'std_vec':std_vec,\
'lr_model':lr_model\
}

return model

当返回的参数过多的时候做成字典会比较好

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
def test_model(X_test,Y_test,M_test,timestamps,feat_sensor_names,label_names,model):

X_test = project_features_to_selected_sensors(X_test,feat_sensor_names,model['sensors_to_use'])
print("== Projected the features to %d features from the sensors: %s" % (X_test.shape[1],', '.join(model['sensors_to_use'])));

# 标准化的方式来标准化这些数据
X_test = standardize_features(X_test,model['mean_vec'],model['std_vec'])


label_ind = label_names.index(model['target_label'])
y = Y_test[:,label_ind]
missing_label = M_test[:,label_ind]
existing_label = np.logical_not(missing_label)


X_test = X_test[existing_label,:]
y = y[existing_label]
timestamps = timestamps[existing_label]


X_test[np.isnan(X_test)] = 0.
X_train,X_test,y_train,y_test = cross_validation.train_test_split(X_train,y,test_size=0.3,random_state=4)
print("== Training with %d examples. For label '%s' we have %d positive and %d negative examples." % \
(len(y_train),get_label_pretty_name(target_label),sum(y_train),sum(np.logical_not(y_train))) )


# 进行预测
y_pred = model['lr_model'].predict(X_test)

# Naive accuracy (correct classification rate):
accuracy = np.mean(y_pred == y_test)

# (学习自统计学习方法) TP:正对正 FN:正对负 FP:负对正 TN:负对负
# TP:正确的肯定数目
# FN:漏报,没有找到正确匹配的数目
# FP:误报,没有的匹配不正确
# TN:正确拒绝的非匹配数目
tp = np.sum(np.logical_and(y_pred,y_test))
tn = np.sum(np.logical_and(np.logical_not(y_pred),np.logical_not(y_test)))
fp = np.sum(np.logical_and(y_pred,np.logical_not(y_test)))
fn = np.sum(np.logical_and(np.logical_not(y_pred),y_test))
# 召回率
# Sensitivity (=recall=true positive rate) and Specificity (=true negative rate):
sensitivity = float(tp) / (tp+fn)# 召回率
specificity = float(tn) / (tn+fp)#

balanced_accuracy = (sensitivity + specificity) / 2.

precision = float(tp) / (tp+fp) # 精确率
F1 = 2*float(tp)/(2*tp+fp+fn)# F1值

print("-"*10);
print('Accuracy: %.2f' % accuracy)
print('Sensitivity (TPR): %.2f' % sensitivity)
print('Specificity (TNR): %.2f' % specificity)
print('Balanced accuracy: %.2f' % balanced_accuracy)
print('Precision: %.2f' % precision)
print('F1: %.2f' % F1)
print("-"*10)

model2 = {'accuracy':accuracy,'sensitivity (TPR)':sensitivity,'specificity (TNR)':specificity,'balanced accuracy':balanced_accuracy,'precision':precision,'F1':F1}
return model2

验证当中我们使用如上的参数进行验证

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
sensors_to_use = ['Acc','Gyro','Magnet']
Accuracy = []
Sensitivity = []
Specificity = []
Balanced_accuracy = []
Precision = []
F1 = []

for label in label_names_clean:
target_label = label# 原始标签是FIX_walking
model = train_model(X,Y,M,feat_sensor_names,label_names,sensors_to_use,target_label)
model2 = test_model(X,Y,M,timestamps,feat_sensor_names,label_names,model)
Accuracy.append(model2['accuracy'])
Sensitivity.append(model2['sensitivity (TPR)'])
Specificity.append(model2['specificity (TNR)'])
Balanced_accuracy.append(model2['balanced accuracy'])
Precision.append(model2['precision'])
F1.append(model2['F1'])

我们将所有label都跑一边

结论

由于这个数据集正负标签数量并不平衡,所以单看某些指标是不合适的,下一步我要将每个label的ROC曲线进行绘制