구간 분할

데이터 분석 알고리즘에 따라서는 연속 데이터를 그대로 사용하기 보다는 일정한 구간(bin)으로 나눠서 분석하는 것이 효율적인 경우가 있다. 가격, 비용, 효율 등 연속적인 갑슬 일정한 수준이나 정돌ㄹ 나타내는 이산적인 값으로 나타내어 구간별 차이를 드러내는 것이다.

이처럼 연속 변수를 일정한 구간으로 나누고, 각 구간을 범주형 이산 변수로 변환하는 과정을 구간 분할(binning)이라고 한다. 판다스 cut() 함수를 이용함녀 연속 데이터를 여러 구간으로 나누고 범주형 데이터로 변환할 수 있다.

예를 들어 앞서 예시에서 'horsepower'부분을 저출력, 보통출력, 고출력 등으로 구간을 나눌 수 있다.

np.histogram() : 구간의 경계값 리스트 구하기

import pandas as pd
import numpy as np

df = pd.read_csv('auto-mpg.csv')

df.columns = ['mpg', 'cylinders',
              'displacement', 'horsepower',
              'weight', 'accelerationi',
              'model year', 'origin', 'name']

df['horsepower'].replace('?', np.nan, inplace=True)
df.dropna(subset=['horsepower'], how='any', inplace=True)
df['horsepower'] = df['horsepower'].astype('float')

# np.histogram 함수로 3개의 bin으로 구분할 경계값의 리스트를 구해보자
count, bin_dividers = np.histogram(df['horsepower'], bins=3)
print("count : ", count)
print("bin_dividers : \\n", bin_dividers)

count :  [257 102  32]
bin_dividers :
 [ 46.         107.33333333 168.66666667 230.        ]

pandas.cut - pandas 1.2.5 documentation

import pandas as pd
import numpy as np

df = pd.read_csv('auto-mpg.csv')

df.columns = ['mpg', 'cylinders',
              'displacement', 'horsepower',
              'weight', 'accelerationi',
              'model year', 'origin', 'name']

df['horsepower'].replace('?', np.nan, inplace=True)
df.dropna(subset=['horsepower'], how='any', inplace=True)
df['horsepower'] = df['horsepower'].astype('float')

# np.histogram 함수로 3개의 bin으로 구분할 경계값의 리스트를 구해보자
count, bin_dividers = np.histogram(df['horsepower'], bins=3)
print("count : ", count)
print("bin_dividers : \\n", bin_dividers)

# 3개의 bin이름을 지정한다
bin_names = ['저출력', '보통출력', '고출력']

# pd.cut 함수로 각 데이터를 3개의 bin에 할당한다.
df['hp_bin'] = pd.cut(x=df['horsepower'],
                      bins=bin_dividers,
                      labels=bin_names,
                      include_lowest=True)  # 첫 경계값을 포함할 것인가?

print(df[['horsepower', 'hp_bin']].head(20))

count :  [257 102  32]
bin_dividers :
 [ 46.         107.33333333 168.66666667 230.        ]
   horsepower hp_bin
0       165.0   보통출력
1       150.0   보통출력
2       150.0   보통출력
3       140.0   보통출력
4       198.0    고출력
5       220.0    고출력
6       215.0    고출력
7       225.0    고출력

count :  [257 102  32]
bin_dividers :
 [ 46.         107.33333333 168.66666667 230.        ]

    horsepower hp_bin
0        165.0   보통출력
1        150.0   보통출력
2        150.0   보통출력
3        140.0   보통출력
4        198.0    고출력
5        220.0    고출력
6        215.0    고출력
7        225.0    고출력
8        190.0    고출력
9        170.0    고출력
10       160.0   보통출력
11       150.0   보통출력
12       225.0    고출력
13        95.0    저출력
14        95.0    저출력
15        97.0    저출력
16        85.0    저출력
17        88.0    저출력
18        46.0    저출력
19        87.0    저출력

더미 변수

카테고리를 나타내는 범주형 데이터를 회귀분석 등 머신러닝 알고리즘에 바로 사용할 수 없는 경우가 있는데, 컴퓨터가 인식 가능한 입력값으로 변환해야 한다.

이럴 때 숫자 0 또는 1로 표현되는 더미 변수(dummy variable)를 사용한다. 여기서 0과 1은 어떤 특성(feature)이 있는지 없는지 여부만 표시한다.

이처럼 범주형 데이터를 컴퓨터가 인식할 수 있도록 숫자 0과 1로만 구성되는 one hot vector로 변환한다고 해서 **원핫인코딩(one-hot-encoding)**이라고 부른다.

pd.get_dummies() : 더미 변수로 변환

판다스 get_dummies() 함수를 사용하면, 범주형 변수의 모든 고유값을 각각 새로운 더미 변수로 변환한다.