The Most Undervalued Standard Python Library

Oct 26, 2019 17:08 · 588 words · 3 minute read

Python has a lot of great libraries included out of the box. One of which is collections. The collections module provides “high-performance container datatypes” which provide alternatives to the general-purpose containers dict, list, set, and tuple. I’d love to introduce you to three of these datatypes and in the end, you’ll be wondering how you ever lived without them.

NamedTuple

I can’t overstate how useful namedtuples can be for data scientists. Let me know if this scenario sounds familiar: You are doing feature engineering and since you love lists, you just keep appending the features to a list, which you then feed into your machine learning model. Soon, you might have hundreds of features and that’s when things get messy. You no longer remember which feature refers to which index in your list. Worse, when someone else looks at your code, they have no idea what is going on with this monstrous list of features.

Enter NamedTuples to save the day.

With just a few extra lines of code, your crazy messy list will be restored to order. Let’s take a look

If you were to run this code, it would print out “22”, the age you stored in your row. This is amazing! Now you don’t have to use indexes to access your features but instead can use human-understandable names. This makes your code significantly more maintainable and clean.

Counter

Counter is aptly named — its main function is counting. This sounds simple, but it turns out that data scientists often have to count things, so it can be very handy.

There are a few ways it can be initialized, but I most often have a list of values and feed that list in as so

If you were to run the above code (which you can by using this awesome tool), you would get the following output:

[(22, 5), (25, 3), (24, 2), (30, 2), (35, 1), (40, 1), (11, 1), (45, 1), (15, 1), (16, 1), (52, 1), (26, 1)]

A list of tuples ordered by the most common where the tuple first contains the value and then the count. So we can now quickly see that 22 is the most common age with 5 occurrences and that there is a long-tail of ages with only 1 count. Nice!

DefaultDict

This is one of my favorites. DefaultDict is a dictionary that is initialized with a default value when each key is encountered for the first time. Here is an example

This returns

defaultdict(<type ‘int’>, {‘a’: 4, ' ‘: 8, ‘c’: 1, ‘e’: 2, ’d’: 2, ‘f’: 2, ‘i’: 1, ‘h’: 1, ‘l’: 1, ‘o’: 2, ‘n’: 1, ’s': 3, ‘r’: 2, ‘u’: 1, ’t': 3, ‘x’: 1})

Normally, when you try and access a value not in a dictionary it throws an error. There are other ways to handle this, but they add unnecessary code when you have a default value you want to assume. In our example above, we initialize the defauldict with int. That means on first access, it will assume a zero, so we can easily just keep adding up the counts of all of the characters. Simple and clean. Another common initialization is list, which allows you to immediately start appending values upon first access.

Photo by Hitesh Choudhary on Unsplash Photo by Hitesh Choudhary on Unsplash

Go Write Cleaner Code

Now that you know about the collections library and some of its awesome features, go use them! You’ll be surprised by how often they are useful and how much nicer your code will be. Enjoy!