機器學習-特徵工程-字典特徵抽取(feature extraction)

七月 5, 2018 Python機器學習本文总阅读量次

機器學習

機器學習是從數據(data)中自動分析獲得模型(model)，並利用規律對未知的數據進行預測(predict)

數據和特徵決定了機器學習的上限，而模型和算法只是逼近這個上限而已

one-hot 編碼

不讓不同的特徵(皮膚顏色：黃、黑、白, 性別：男、女)之間有優先級之分
把不同類別的特徵轉換為以下形式，利於進行分析

Sample\皮膚顏色	黑	黃	白
1	0	1	0
2	1	0	0
3	0	0	1

特徵抽取/提取(feature extraction)

特徵抽取是對任意數據(文本or圖像)等數據進行特徵值化便于機器學習

特徵值化是為了計算機更好的去理解數據

sklearn特徵抽取API

使用sklearn.feature_extraction

字典特徵抽取

對字典數據進行特徵值化

把字典中一些類別的數據，分別轉換成特徵
應用場景：
- 數據集中有很多的類別特徵(性別，人員乘坐的船艙…等)
  1. 將數據集的特徵轉換成字典類型
  2. DictVectorizer轉換
- 本身拿到的數據就是字典類型即直接使用字典特徵抽取
使用sklearn.feature_extraction.DictVectorizer

DictVectorizer(sparse=True,…)

實例化了一個字典向量化的實例(轉換器類)

sparse矩陣(節約內存，方便讀取處理)
sparse=False時則返回一般矩陣(ndarray type)

fit_transform(X)

使用此方法就能將一個字典數據化

X: 字典或者包含字典的迭代器
返回值: 返回sparse矩陣

inverse_transform(X)

X: array陣列或者sparse矩陣
返回值: 轉換之前的數據格式(將陣列轉回原本的字典，形式會改變)

[{'city=Taipei': 1.0, 'temperature': 35.0}, 
 {'city=Tainan': 1.0, 'temperature': 32.0}, 
 {'city=Nantou': 1.0, 'temperature': 30.0}, 
 {'city=Chiayi': 1.0, 'temperature': 31.0}]

get_feature_names()

返回類別的名稱

['city=Chiayi', 
 'city=Nantou', 
 'city=Tainan', 
 'city=Taipei', 
 'temperature']

transform(X)

按照原先的標準轉換

流程

實例化類DictVectorizer
調用fit_transform方法輸入數據並轉換

注意返回格式為sparse矩陣

Example

from sklearn.feature_extraction import DictVectorizer

dictData = [
    {"city":"Taipei","temperature":35},
    {"city":"Tainan","temperature":32},
    {"city":"Nantou","temperature":30},
    {"city":"Chiayi","temperature":31},
]

#字典數據抽取
def dictvec():

    # 實例化 字典轉換器類
    dict = DictVectorizer()

    # 調用fit_transform
    trans_data = dict.fit_transform(dictData)

    print(trans_data)

if __name__ == '__main__':
    dictvec()

sparse = True(預設)

結果返回稀疏(sparse)矩陣

(0, 3)	1.0
(0, 4)	35.0
(1, 2)	1.0
(1, 4)	32.0
(2, 1)	1.0
(2, 4)	30.0
(3, 0)	1.0
(3, 4)	31.0

sparse矩陣較節約內存，提高加載數據的效率
前面的元組為第幾行，第幾列，後面的數字為值

sparse = False

DictVectorizer(sparse=False)，返回ndarray(一般矩陣)

[[ 0.  0.  0.  1. 35.]
 [ 0.  0.  1.  0. 32.]
 [ 0.  1.  0.  0. 30.]
 [ 1.  0.  0.  0. 31.]]