分箱处理

  • 发布日期:2019-11-27
  • 难度:简单
  • 类别:数据预处理、噪声
  • 标签:Python、噪声数据、分箱法

1. 问题描述

使用Python第三方库numpy和pandas,通过分箱法进行噪声数据。

2. 程序实现

In [6]:
import pandas as pd
import numpy as np
income=np.array([800, 1000, 1200, 1500,1500,1800,2000,2300,2500,2800,3000,3500,4000,4500,4800,5000])
colBin = pd.cut(income,bins=4)
print(pd.value_counts(colBin,sort=False))
(795.8, 1850.0]     6
(1850.0, 2900.0]    4
(2900.0, 3950.0]    2
(3950.0, 5000.0]    4
dtype: int64
In [7]:
print(colBin)
[(795.8, 1850.0], (795.8, 1850.0], (795.8, 1850.0], (795.8, 1850.0], (795.8, 1850.0], ..., (2900.0, 3950.0], (3950.0, 5000.0], (3950.0, 5000.0], (3950.0, 5000.0], (3950.0, 5000.0]]
Length: 16
Categories (4, interval[float64]): [(795.8, 1850.0] < (1850.0, 2900.0] < (2900.0, 3950.0] < (3950.0, 5000.0]]
In [8]:
def binning(col, cut_points, labels=None):
    minval = col.min()
    maxval = col.max()
    # 利用最大值和最小值创建分箱端点值
    break_points = [minval] + cut_points + [maxval]
    # 如果没有标签,则使用默认标签0,1,2,...,(n-1)
    if not labels:
        labels = range(len(cut_points) + 1)
    colBin = pd.cut(col,bins=break_points,labels=labels,right=True,include_lowest=True)
    return colBin
cut_points = [1000, 2500, 4500]
labels = ["low", "medium", "high", "very high"]
income_box = binning(income, cut_points, labels)
print(pd.value_counts(income_box, sort=False))
low          2
medium       7
high         5
very high    2
dtype: int64
In [9]:
print(income_box)
[low, low, medium, medium, medium, ..., high, high, high, very high, very high]
Length: 16
Categories (4, object): [low < medium < high < very high]
In [ ]:
 
In [ ]: