How To Create Bins For Two Columns Of Data

In this tutorial, you lot'll acquire how to bin data in Python with the Pandas cutting and qcut functions. You'll acquire why binning is a useful skill in Pandas and how you can use it to improve group and distill information. Past the end of this tutorial, y'all'll accept learned:

How to apply the cut and qcut functions in Pandas
When to utilize which function
How to modify the behavior of these functions to customize the bins that are created

What is Binning in Pandas and Python?

In many cases when dealing with continuous numeric information (such equally ages, sales, or incomes), it can be helpful to create bins of your information. Binning data will catechumen data into discrete buckets, allowing you to gain insight into your information in logical ways. Binning data is also often referred to under several other terms, such as discrete binning, quantization, and discretization.

In this tutorial, yous'll learn almost two dissimilar Pandas methods, .cutting() and .qcut() for binning your data. These methods will allow y'all to bin data into custom-sized bins and as-sized bins, respectively. Equal-sized bins let you lot to gain like shooting fish in a barrel insight into the distribution, while grouping information into custom bins can allow you lot to gain insight into logical categorical groupings.

Loading a Sample Pandas DataFrame

To follow along with the tutorial, permit's use a very simple Pandas DataFrame. The data is deliberately kept simple to amend understand how the data is existence carve up. The dataset has merely two columns: a Proper name column and an Age column. Let's load the data using the .from_dict() method:

          # Loading a Sample Pandas DataFrame import pandas every bit pd  df = pd.DataFrame.from_dict({     'Name': ['Ray', 'Jane', 'Kate', 'Nik', 'Autumn', 'Kasi', 'Mandeep', 'Evan', 'Kyra', 'Jim'],     'Age': [12, 7, 33, 34, 45, 65, 77, 11, 32, 55] })  print(df.head())  # Returns: #      Name  Historic period # 0     Ray   12 # one    Jane    7 # 2    Kate   33 # 3     Nik   34 # 4  Autumn   45

In the side by side section, yous'll learn how to use the Pandas .qcut() method to bin data into equal-sized bins.

Pandas qcut: Binning Information into Equal-Sized Bins

The Pandas .qcut() method splits your data into equal-sized buckets, based on rank or some sample quantiles. This process is known as quantile-based discretization. Let's take a look at the parameters bachelor in the function:

          # Parameters of the Pandas .qcut() method pd.qcut(     ten,                      # Column to bin     q,                      # Number of quantiles     labels=None,            # List of labels to include     retbins=Faux,          # Whether to render the bins/labels or not     precision=3,            # The precision to store and brandish the bins labels     duplicates='enhance'      # If bin edges are not unique, enhance a ValueError )

The function simply has two required parameters, the column to bin (x=) and the number of quantiles to generate (q=). The office returns a Series of data that can, for example, be assigned to a new column. Let'south see how we can split our Historic period column into four different quantiles:

          # Splitting Historic period Column into 4 Quantiles df['Age Groups'] = pd.qcut(df['Age'], 4) print(df.head())  # Returns: #      Name  Age     Historic period Groups # 0     Ray   12  (half-dozen.999, 17.0] # 1    Jane    vii  (6.999, 17.0] # 2    Kate   33   (17.0, 33.5] # iii     Nik   34   (33.v, 52.v] # iv  Fall   45   (33.5, 52.v]

At outset glance, this new Age Groups cavalcade may look a petty strange. Let'southward take a moment to explore it a bit. First, we'll have a wait at the data type of the column, using the .dtype attribute.

          # Checking the data type of the qcut column df['Age Groups'] = pd.qcut(df['Age'], iv) print(df['Age Groups'].dtype)  # Returns: category

The data type that gets returned is category, which is an incredibly memory-efficient way for Pandas to shop categorical information. Let's take a look at what the actual labels in the column mean:

Understanding Pandas binning brackets — What the brackets in Pandas binning mean

The prototype above shows that a square bracket, [ or ], indicates that that data bespeak is included in the range. A regular parenthesis such every bit ( or ) indicates that the edge is not included in the group.

Splitting Data Into Equal Percentiles Using Pandas qcut

Rather than simply passing in a number of groupings you lot want to create, yous can also laissez passer in a list of quartiles you want to create. This list should be a range from 0 through i, splitting the data into equal percentages. Allow' see how we can split our data into 25% bins.

                      # Splitting Age Column into Iv Quantiles df['Historic period Groups'] = pd.qcut(                        df['Age'],                        [0, 0.25, 0.five, 0.75, 1] ) print(df.caput())            # Returns: #      Name  Age     Historic period Groups # 0     Ray   12  (6.999, 17.0] # 1    Jane    7  (half-dozen.999, 17.0] # 2    Kate   33   (17.0, 33.five] # 3     Nik   34   (33.5, 52.v] # 4  Autumn   45   (33.5, 52.v]

You tin can see here that this returned the aforementioned result every bit we had before. Our data is carve up into four equal-sized buckets based on the ranges of the data.

Adding Labels to Bins in Pandas with qcut

Right now, the bins of our dataset are descriptive, just they're besides a piddling hard to read. You tin can laissez passer in a list of labels that you want to relabel your dataset as. The length of the listing should friction match the number of bins beingness created. Let's see how we tin convert our grouped data into descriptive labels:

          # Adding Labels to Pandas .qcut() df['Age Groups'] = pd.qcut(    df['Age'],                        [0, 0.25, 0.5, 0.75, i],                        labels=['0-25%', '26-49%', '51-75%', '76-100%'] ) print(df.head())  # Returns: #      Name  Age Age Groups # 0     Ray   12      0-25% # 1    Jane    7      0-25% # ii    Kate   33     26-49% # 3     Nik   34     51-75% # 4  Autumn   45     51-75%

This makes our Pandas binning process much easier to understand!

Modifying Bin Precision in Pandas with qcut

Permit's go back to our earlier example, where we but passed in q=4 to split the data into four quantiles. The bins returned with a high degree of precision and looked like this: (six.999, 17.0]. By default, Pandas will utilize a precision=three statement, which results in 3 precision points to shop and brandish the bins.

While this is more precise and accurate, it oft doesn't await very nice. Let'due south endeavour changing the precision to exist 1 and see what our categories look like now:

          # Modifying Precision in Categories df['Age Groups'] = pd.qcut(    df['Age'],     4,     precision=1 ) print(df.head())  # Returns: #      Name  Age    Age Groups # 0     Ray   12   (6.9, 17.0] # 1    Jane    7   (six.ix, 17.0] # two    Kate   33  (17.0, 33.v] # 3     Nik   34  (33.5, 52.5] # four  Fall   45  (33.5, 52.5]

This is much easier to read and empathize how the categories piece of work, though y'all do lose some precision.

In the following section, you lot'll learn how to use the Pandas cutting method to define custom bins of data.

Pandas cut: Binning Data into Custom Bins

The Pandas cut function is closely related to the .qcut() office. Withal, it's used to bin values into discrete intervals, which you lot define yourself. This, for example, can be very helpful when defining meaningful age groups or income groups. In many cases, these groupings volition accept some other type of meaning, such as legal or cultural.

The Pandas .cut() part can, technically, accomplish the same results every bit the .qcut() function, just it likewise provides significantly more command over the results. Allow'southward take a await at the functions parameters:

          # Parameters of the .cut() Function pd.cut(     x,                          # The input array to be binned     bins,                       # The bins to use: int (# of bins) or sequence (widths)      right=True,                 # Whether to include correct-most edge     labels=None,                # Labels to exist used for bins     retbins=False,              # Whether to return bins or not     precision=3,                # Precision to shop and brandish bins     include_lowest=Fake,       # Whether showtime interval should be left inclusive or not     duplicates='raise',         # What to practise if bins edges are not unique     ordered=True                # Whether labels are ordered or not )

You can see that there is a good amount of overlap between the parameters bachelor in the .qcut() and .cut() functions. Withal, the cut function also provides significantly more options. For instance, equally you'll learn soon, you can define how Pandas handles the edges of its bins.

Let'due south see how we tin split the Age column into three different groups: under 18, betwixt xix and 65, and 65 and older.

          df['Age Group'] = pd.cut(    df['Age'],     [0, 17, 64, 100] ) impress(df.head())  # Returns: #      Name  Age Historic period Group # 0     Ray   12   (0, 17] # 1    Jane    7   (0, 17] # 2    Kate   33  (17, 64] # three     Nik   34  (17, 64] # 4  Fall   45  (17, 64]

You can see that you've created three separate age groups here. As the brackets indicate, the values go from >0 to 17, >=18 to 64, >=65 to 100. In the next section, you'll acquire how to utilize labels to these groupings.

Adding Labels to Bins in Pandas with cut

In this section, yous'll learn how to employ the labels= parameter to laissez passer in a list of labels. Similar to the qcut part, the labels need to be of the same length as the number of groupings.

Allow's pass in some string labels to make the groupings easier to read:

          # Calculation labels to the groupings df['Age Group'] = pd.cut(     df['Age'],      [0, 17, 64, 100],      labels=['0-eighteen years erstwhile', '18-65 years quondam', '65+ years old'] ) impress(df.head())  # Returns: #      Proper noun  Age        Historic period Grouping # 0     Ray   12   0-eighteen years erstwhile # i    Jane    vii   0-18 years old # 2    Kate   33  xviii-65 years quondam # 3     Nik   34  xviii-65 years sometime # four  Autumn   45  eighteen-65 years old

Yous can see that these results are much easier to read and interpret!

Modifying Border Behaviour in Pandas cutting

By default, Pandas will include the right-almost edge of a grouping. Previously, when y'all defined the bins of [0, 17, 64, 100], this defined the following bins:

>0 to 17
>17 to 64
>64 to 100

In our instance, this is fine as we're dealing with integer values. However, imagine that our ages were defined as floating-bespeak values and nosotros had an age of 17.5. In our example, since the historic period goes upwards to (and includes) 17, the value of 17.5 would be incorrectly included in our xviii-64 age group.

Nosotros can utilize the correct= parameter to modify this behavior. The argument defaults to True and identifies that the correct-most value should be included. If nosotros modify this value to False, then the bin volition include all values up to (but not including) that value.

Let's recreate the aforementioned bins, but with a correct sectional range:

          # Using the right= argument to modify binning behavior df['Age Group'] = pd.cut(     df['Age'],      [0, 18, 65, 100],      labels=['0-18 years former', '18-65 years old', '65+ years onetime'],     right=False ) print(df.head())  # Returns: #      Name  Historic period        Age Grouping # 0     Ray   12   0-18 years erstwhile # 1    Jane    vii   0-18 years onetime # 2    Kate   33  eighteen-65 years old # iii     Nik   34  eighteen-65 years one-time # 4  Fall   45  eighteen-65 years old

Modifying Get-go Interval Behaviour with Pandas cut

By default, Pandas will not include the left-well-nigh value in the bin. In the example above, if we'd included an age of 0, the value would not have been binned. If we wanted this value to exist included, we could use the include_lowest= argument to alter the behavior.

By default, the argument will use a value of Faux. Modifying this to Truthful will include that left-most value. Let'south see how to practise this:

          # Including left-most values df['Age Group'] = pd.cutting(     df['Age'],      [0, xviii, 65, 100],      labels=['0-18 years old', '18-65 years old', '65+ years old'],     include_lowest=Truthful ) impress(df.head())  # Returns: #      Name  Age        Historic period Grouping # 0     Ray   12   0-xviii years old # 1    Jane    7   0-eighteen years onetime # ii    Kate   33  eighteen-65 years sometime # 3     Nik   34  18-65 years old # 4  Autumn   45  18-65 years former

Creating Ordered Categories with Pandas cut

Starting time in Pandas version 1.1.0, the Pandas cut function volition return an ordered categorical bin. This assigns an social club to the values of that category. Let's see what this beliefs looks like when the default behavior is used.

          # Creating Ordered Categories print(pd.cutting(     df['Age'],      [0, 18, 65, 100],      labels=['0-xviii years old', 'xviii-65 years old', '65+ years former'],     ordered=True ))  # Returns: # 0     0-18 years quondam # one     0-18 years sometime # two    18-65 years onetime # 3    18-65 years old # iv    18-65 years old # v    18-65 years old # vi      65+ years old # 7     0-xviii years sometime # viii    xviii-65 years old # 9    18-65 years old # Name: Age, dtype: category # Categories (iii, object): ['0-18 years one-time' < '18-65 years sometime' < '65+ years quondam']

This allows you to sort categorical values, which are often represented by strings. This is a bully do good over using string values, since you're able to sort values in a meaningful style.

Modifying the behavior to ordered=False removes this hierarchy, if it'due south something that y'all don't want to be created.

Exercises

Information technology'due south fourth dimension to test your learning! Attempt to solve the exercises below. If you need assist or want to double-check your solution, simply toggle the question.

Since the .qcut() function doesn't permit you lot to specify including the lowest value of the range, the cut() part needs to be used.

              df['Age Group'] = pd.cut(     df['Historic period'],      [0, 0.25, 0.v, 0.75, 1],      include_lowest=Truthful,     right=False )

Because categories, though they look like strings, aren't strings, their sorting might not work correctly. By including lodge in your categories, these values can be sorted appropriately.

The cut role allows you to define your own numeric ranges, while the qcut function enforces an equal distribution of the items in the bins.

Conclusion and Recap

In this tutorial, you learned how to bin your data in Python and Pandas using the cutting and qcut functions. The department beneath provides a recap of what you lot learned:

The Pandas qcut function bins information into an equal distributon of items
The Pandas cut function allows you to define your ain ranges of data
Binning your data allows yous to both get a better agreement of the distribution of your information too equally creating logical categories based on other abstractions
Both functions gives you flexibility in defining and displaying your bins

Additional Resources

To learn well-nigh related topics, bank check out the tutorials below:

Python Defaultdict: Overview and Examples
Pandas GroupBy: Group, Summarize, and Aggregate Data in Python
Pandas Describe: Descriptive Statistics on Your Dataframe
Pandas cut Official Documentation