Feature Engineering for AI and Machine Learning?

4 min readAug 19, 2020

สวัสดีงับ ในบทความนี้เราจะมาพูดถึงเรื่อง ประโยชน์หลักๆและความหมาย ของการทำ Feature Engineering

ความหมายคือ อัลกอริทึม ของ Machine Learning จะเรียนรู้จาก Input มาสร้างเป็น Output ในตัว Input จะประกอบด้วย Feature ต่าง ๆ อยู่ในแต่ละ Column อัลกอริทึมจะต้องมองทะลุเข้าไปเรียนรู้ถึง Pattern ที่ซ่อนอยู่ใต้ Feature เหล่านั้น
ประโยชน์ -เพิ่มประสิทธิภาพให้ Machine Learning

มาเริ่มปฎิบัติกันเลยยยย…

ทำการติดตั้ง Pandas Profiling Library ด้วยคำสั่ง pip install

pip install pandas-profiling[notebook]

หลังจากนั้นจึง Load Dataset และเรียกใช้ ProfileReport Function บน Jupyter Notebook ชื่อ titanic.csv

import

import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport

ดึงไฟล์Data set ชื่อ titanic.csv

df = pd.read_csv(‘titanic.csv’, sep = ‘,’)
profile = ProfileReport(data, title="Pandas Profiling Report")
profile

ต่อจากนั้น ก็แสดงตารางออกมา

df

ProfileReport

profile = ProfileReport(df, title=”Pandas Profiling Report”)
profile

ใช้ Function isnull() และ sum() คำนวนจำนวน เพื่อหาMissing Value ในแต่ละ Column

ดูข้อมูล 5 แถวแรกของตาราง

df.head()

Imputation

จากคำสั่ง df.isnull().sum() เราพบ Column Age เป็น Missing Values ถึง 177 Cell ซึ่งเราจะทดลองแทนที่ Missing Value เหล่านั้นด้วยค่าเฉลี่ยของราคาไวน์ทั้งหมด ใน Column price โดยใช้ Function fillna()

new_df = df.copy()
new_df[‘Age’].fillna(df[‘Age’].mean(), inplace = True)
print(new_df.isnull().sum())

df.isnull().mean()

ลบ Row หรือ Column ทิ้ง จากการกำหนดค่า Threshold

threshold = 0.5
new_df = df[df.columns[df.isnull().mean() < threshold]]
new_df.isnull().mean()

print(df.shape)
new_df = df.loc[df.isnull().mean(axis=1) < threshold]
print(new_df.shape)

print(new_df.isnull().sum())

ค่าที่ผิดปกติ (Outlier Value) หรือค่าน้อยๆ หรือค่ามากๆ ซึ่งจะมีผลกระทบต่อการคิดค่าเฉลี่ย วิธีหนึ่งในการแก้ปัญหาคือการแทน Missing Value ด้วยค่ามัธยฐานของแต่ละ Column แทน

print(df.median())
new_df = df.fillna(df.median())
print(new_df.isnull().sum())

แทน Missing Value ทุก Cell ด้วย 0

new_df = df.fillna(0)
print(new_df.isnull().sum())

new_df.head()

การลบทั้งแถวทิ้ง ในกรณีที่พบ Cell หนึ่ง Cell ใดมี Missing Value ซึ่งจะเป็นวิธีการที่ดีถ้าไม่ทำให้มีข้อมูลหายไปเป็นจำนวนมาก

print(df.shape)
new_df = df.dropna(how=’any’)
print(new_df.shape)

Handling Outliers

เราจะทำ Data Visualization เพื่อดูลักษณะการกระจายของข้อมูล และ Outlier จาก Function boxplot() ของ Seaborn Library

import seaborn as sns
from matplotlib import pyplot as plt
fig = plt.figure(figsize=(12,8))
sns.boxplot(x=df['Age'], color='blue')
plt.xlabel('Age', fontsize=14)
plt.show()

df[‘Age’].describe()

Drop Outlier with Standard Deviation

เราจะทดลองลบแถวที่พบ Outlier ใน Column

print(df.shape)
factor = 3
upper_lim = df[‘Age’].mean () + df[‘Age’].std () * factor
lower_lim = df[‘Age’].mean () — df[‘Age’].std () * factordrop_outlier1 = df[(df[‘Age’] < upper_lim) & (df[‘Age’] > lower_lim)]
print(drop_outlier1.shape)

fig = plt.figure(figsize=(12,8))
sns.boxplot(x=drop_outlier1[‘Age’], color=’blue’)
plt.xlabel(‘Age’, fontsize=14)
plt.show()

drop_outlier1[‘Age’].describe()

Drop with Percentiles

เราสามารถลบแถวที่พบ Outlier ใน Column Ageที่น้อยกว่าหรือเท่ากับ Quantile 0.5 และมากกว่าหรือเท่ากับ Quantile 0.95

print(df.shape)
upper_lim = df[‘Age’].quantile(.95)
lower_lim = df[‘Age’].quantile(.05)
drop_outlier2 = df[(df[‘Age’] < upper_lim) & (df[‘Age’] >lower_lim)]
print(drop_outlier2.shape)

fig = plt.figure(figsize=(6,6))
sns.boxplot(x=drop_outlier2[‘Age’], color=’lime’)
plt.xlabel(‘Age’, fontsize=14)
plt.show()

drop_outlier2[‘Age’].describe()

Binning

การทำ Binning หรือการแบ่งข้อมูลออกตามช่วงที่กำหนด จะทำให้สามารถป้องกันการเกิด Overfitting เมื่อมีการ Train Model

drop_outlier2[‘log’] =(drop_outlier2[‘Age’]).transform(np.log)
drop_outlier2.sample(n=5).head()

Log Transform

Log Transform เป็นการใช้ Log ทางคณิตศาสตร์แปลงข้อมูล ซึ่งจะช่วยลดการเบ้ของข้อมูล โดยหลังการแปลงข้อมูลแล้ว จะทำให้การกระจายตัวเข้าสู่ Normal Distribution มากขึ้น

drop_outlier2[‘log’] = (drop_outlier2[‘Age’]).transform(np.log)
drop_outlier2.sample(n=5).head()

One-hot encoding

One-hot Encoding เป็นการเข้ารหัสข้อมูลแบบหนึ่งที่มักจะใช้กันบ่อยในงานทางด้าน Machine Learning โดยการขยายข้อมูลจากเดิมที่มี Column เดียว เป็นค่า 0 และ 1 หลายๆ Column ตามจำนวนหมวดหมู่ของข้อมูลใน Column เดิม โดยจะมีการกำหนดค่าเป็น 1 ใน Column ใหม่ และตำแหน่งของ Column จะแทนลำดับของหมวดหมู่ของข้อมูลเดิม แล้วกำหนดค่า 0 ใน Column อื่นๆ ที่เหลือ

encoded_columns = pd.get_dummies(drop_outlier2[‘Age_cat’])
drop_outlier2 = drop_outlier2.join(encoded_columns)
drop_outlier2.sample(n=5).head()

จบบทความนี้แล้วหวังว่าทุกท่านจะได้รับประโยชน์จากบทความนี้นะจ้ะ…แล้วเจอกับในโอกาสหน้านะครับ ขอบคุณครับ…………..