Python - Pandas, Resample Dataset To Have Balanced Classes
With the following data frame, with only 2 possible lables: name f1 f2 label 0 A 8 9 1 1 A 5 3 1 2 B 8 9 0 3 C 9 2 0 4
Solution 1:
A very simple approach. Taken from sklearn documentation and Kaggle.
from sklearn.utils import resample
df_majority = df[df.label==0]
df_minority = df[df.label==1]
# Upsample minority class
df_minority_upsampled = resample(df_minority,
replace=True, # sample with replacement
n_samples=20, # to match majority class
random_state=42) # reproducible results# Combine majority class with upsampled minority class
df_upsampled = pd.concat([df_majority, df_minority_upsampled])
# Display new class counts
df_upsampled.label.value_counts()
Solution 2:
Provided that each name is labeled by exactly one label (e.g. all A are 1) you can use the following:
- Group the
names bylabeland check which label has an excess (in terms of unique names). - Randomly remove names from the over-represented label class in order to account for the excess.
- Select the part of the data frame which does not contain the removed names.
Here is the code:
labels = df.groupby('label').name.unique()
# Sort the over-represented class to the head.labels = labels[labels.apply(len).sort_values(ascending=False).index]
excess = len(labels.iloc[0]) - len(labels.iloc[1])
remove = np.random.choice(labels.iloc[0], excess, replace=False)
df2 = df[~df.name.isin(remove)]
Solution 3:
Using imbalanced-learn (pip install imbalanced-learn), this is as simple as:
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(sampling_strategy='not minority', random_state=1)
df_balanced, balanced_labels = rus.fit_resample(df, df['label'])
There are many methods other than RandomUnderSampler, so I suggest you read the documentation.
Solution 4:
You can also sample from majority class based on the minority class:
### Separate the majority and minority classesdf_miority = df[df['label']==1]
df_majority = df[df['label']==0]
### Now, downsamples majority labels equal to the number of samples in the minority classdf_majority = df_majority.sample(len(df_minority), random_state=0)
### concat the majority and minority dataframesdf = pd.concat([df_majority,df_minority])
## Shuffle the dataset to prevent the model from getting biased by similar samplesdf = df.sample(frac=1, random_state=0)
Solution 5:
You can make use of a grouped representation for resampling.
def balance_df(frame: pd.DataFrame, col: str, upsample_minority: bool):
grouped = frame.groupby(col)
n_samp = {
True: grouped.size().max(),
False: grouped.size().min(),
}[upsample_minority]
fun = lambda x: x.sample(n_samp, replace=upsample_minority)
balanced = grouped.apply(fun)
balanced = balanced.reset_index(drop=True)
return balanced
Post a Comment for "Python - Pandas, Resample Dataset To Have Balanced Classes"