Python - Pandas, Resample Dataset To Have Balanced Classes

April 06, 2024 Post a Comment

With the following data frame, with only 2 possible lables: name f1 f2 label 0 A 8 9 1 1 A 5 3 1 2 B 8 9 0 3 C 9 2 0 4

Solution 1:

A very simple approach. Taken from sklearn documentation and Kaggle.

from sklearn.utils import resample

df_majority = df[df.label==0]
df_minority = df[df.label==1]

# Upsample minority class
df_minority_upsampled = resample(df_minority, 
                                 replace=True,     # sample with replacement
                                 n_samples=20,    # to match majority class
                                 random_state=42) # reproducible results# Combine majority class with upsampled minority class
df_upsampled = pd.concat([df_majority, df_minority_upsampled])

# Display new class counts
df_upsampled.label.value_counts()

Solution 2:

Provided that each name is labeled by exactly one label (e.g. all A are 1) you can use the following:

Group the names by label and check which label has an excess (in terms of unique names).
Randomly remove names from the over-represented label class in order to account for the excess.
Select the part of the data frame which does not contain the removed names.

Here is the code:

labels = df.groupby('label').name.unique()
# Sort the over-represented class to the head.labels = labels[labels.apply(len).sort_values(ascending=False).index]
excess = len(labels.iloc[0]) - len(labels.iloc[1])
remove = np.random.choice(labels.iloc[0], excess, replace=False)
df2 = df[~df.name.isin(remove)]

Solution 3:

Using imbalanced-learn (pip install imbalanced-learn), this is as simple as:

from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(sampling_strategy='not minority', random_state=1)
df_balanced, balanced_labels = rus.fit_resample(df, df['label'])

There are many methods other than RandomUnderSampler, so I suggest you read the documentation.

Solution 4:

You can also sample from majority class based on the minority class:

### Separate the majority and minority classesdf_miority  = df[df['label']==1]
df_majority = df[df['label']==0]

### Now, downsamples majority labels equal to the number of samples in the minority classdf_majority = df_majority.sample(len(df_minority), random_state=0)

### concat the majority and minority dataframesdf = pd.concat([df_majority,df_minority])

## Shuffle the dataset to prevent the model from getting biased by similar samplesdf = df.sample(frac=1, random_state=0)

Solution 5:

You can make use of a grouped representation for resampling.

def balance_df(frame: pd.DataFrame, col: str, upsample_minority: bool):
    grouped = frame.groupby(col)
    n_samp = {
        True: grouped.size().max(),
        False: grouped.size().min(),
    }[upsample_minority]

    fun = lambda x: x.sample(n_samp, replace=upsample_minority)
    balanced = grouped.apply(fun)
    balanced = balanced.reset_index(drop=True)
    return balanced

Python Development