Real Python: Using the Python defaultdict Type for Handling Missing Keys

栏目: IT技术 · 发布时间: 4年前

内容简介:A common problem that you can face when working with Pythondictionaries is to try to access or modify keys that don’t exist in the dictionary. This will raise aThe PythonWith this knowledge under your belt, you’ll be in a better condition to effectively us

A common problem that you can face when working with Pythondictionaries is to try to access or modify keys that don’t exist in the dictionary. This will raise a KeyError and break up your code execution. To handle these kinds of situations, the standard library provides the Python defaultdict type, a dictionary-like class that’s available for you in collections .

The Python defaultdict type behaves almost exactly like a regular Python dictionary, but if you try to access or modify a missing key, then defaultdict will automatically create the key and generate a default value for it. This makes defaultdict a valuable option for handling missing keys in dictionaries.

In this tutorial, you’ll learn:

  • How to use the Python defaultdict type for handling missing keys in a dictionary
  • When and why to use a Python defaultdict rather than a regular dict
  • How to use a defaultdict for grouping , counting , and accumulating operations

With this knowledge under your belt, you’ll be in a better condition to effectively use the Python defaultdict type in your day-to-day programming challenges.

To get the most out of this tutorial, you should have some previous understanding of what Pythondictionaries are and how to work with them. If you need to freshen up, then check out the following resources:

Free Bonus:Click here to get a Python Cheat Sheetand learn the basics of Python 3, like working with data types, dictionaries, lists, and Python functions.

Handling Missing Keys in Dictionaries

A common issue that you can face when working with Python dictionaries is how to handle missing keys . If your code is heavily based on dictionaries, or if you’re creating dictionaries on the fly all the time, then you’ll soon notice that dealing with frequent KeyError exceptions can be quite annoying and can add extra complexity to your code. With Python dictionaries, you have at least four available ways to handle missing keys:

  1. Use .setdefault()
  2. Use .get()
  3. Use the key in dict idiom
  4. Use a try and except block

The Python docs explain .setdefault() and .get() as follows:

setdefault(key[, default])

If key is in the dictionary, return its value. If not, insert key with a value of default and return default . default defaults to None .

get(key[, default])

Return the value for key if key is in the dictionary, else default . If default is not given, it defaults to None , so that this method never raises a KeyError .

( Source )

Here’s an example of how you can use .setdefault() to handle missing keys in a dictionary:

>>>
>>> a_dict = {}
>>> a_dict['missing_key']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
    a_dict['missing_key']
KeyError: 'missing_key'
>>> a_dict.setdefault('missing_key', 'default value')
'default value'
>>> a_dict['missing_key']
'default value'
>>> a_dict.setdefault('missing_key', 'another default value')
'default value'
>>> a_dict
{'missing_key': 'default value'}

In the above code, you use .setdefault() to generate a default value for missing_key . Notice that your dictionary, a_dict , now has a new key called missing_key whose value is 'default value' . This key didn’t exist before you called .setdefault() . Finally, if you call .setdefault() on an existing key, then the call won’t have any effect on the dictionary. Your key will hold the original value instead of the new default value.

Note:In the above code example, you get an exception, and Python shows you a traceback message, which tells you that you’re trying to access a missing key in a_dict . If you want to dive deeper into how to decipher and understand a Python traceback, then check out Understanding the Python Traceback .

On the other hand, if you use .get() , then you can code something like this:

>>>
>>> a_dict = {}
>>> a_dict.get('missing_key', 'default value')
'default value'
>>> a_dict
{}

Here, you use .get() to generate a default value for missing_key , but this time, your dictionary stays empty. This is because .get() returns the default value, but this value isn’t added to the underlying dictionary. For example, if you have a dictionary called D , then you can assume that .get() works something like this:

D.get(key, default) -> D[key] if key in D, else default

With this pseudo-code, you can understand how .get() works internally. If the key exists, then .get() returns the value mapped to that key. Otherwise, the default value is returned. Your code never creates or assigns a value to key . In this example, default defaults to None .

You can also use conditional statements to handle missing keys in dictionaries. Take a look at the following example, which uses the key in dict idiom:

>>>
>>> a_dict = {}
>>> if 'key' in a_dict:
...     # Do something with 'key'...
...     a_dict['key']
... else:
...     a_dict['key'] = 'default value'
...
>>> a_dict
{'key': 'default value'}

In this code, you use an if statement along with the in operator to check if key is present in a_dict . If so, then you can perform any action with key or with its value. Otherwise, you create the new key, key , and assign it a 'default value' . Note that the above code works similar to .setdefault() but takes four lines of code, while .setdefault() would only take one line (in addition to being more readable).

You can also walk around the KeyError by using a try and except block to handle the exception. Consider the following piece of code:

>>>
>>> a_dict = {}
>>> try:
...     # Do something with 'key'...
...     a_dict['key']
... except KeyError:
...     a_dict['key'] = 'default value'
...
>>> a_dict
{'key': 'default value'}

The try and except block in the above example catches the KeyError whenever you try to get access to a missing key. In the except clause, you create the key and assign it a 'default value' .

Note:If missing keys are uncommon in your code, then you might prefer to use a try and except block ( EAFP coding style ) to catch the KeyError exception. This is because the code doesn’t check the existence of every key and only handles a few exceptions, if any.

On the other hand, if missing keys are quite common in your code, then the conditional statement ( LBYL coding style ) can be a better choice because checking for keys can be less costly than handling frequent exceptions.

So far, you’ve learned how to handle missing keys using the tools that dict and Python offer you. However, the examples you saw here are quite verbose and hard to read. They might not be as straightforward as you might want. That’s why the Python standard library provides a more elegant,Pythonic, and efficient solution. That solution is collections.defaultdict , and that’s what you’ll be covering from now on.

Understanding the Python defaultdict Type

The Python standard library provides collections , which is a module that implements specialized container types. One of those is the Python defaultdict type, which is an alternative to dict that’s specifically designed to help you out with missing keys. defaultdict is a Python type that inherits from dict :

>>>
>>> from collections import defaultdict
>>> issubclass(defaultdict, dict)
True

The above code shows that the Python defaultdict type is a subclass of dict . This means that defaultdict inherits most of the behavior of dict . So, you can say that defaultdict is much like an ordinary dictionary.

The main difference between defaultdict and dict is that when you try to access or modify a key that’s not present in the dictionary, a default value is automatically given to that key . In order to provide this functionality, the Python defaultdict type does two things:

  1. It overrides .__missing__() .
  2. It adds .default_factory , a writable instance variable that needs to be provided at the time of instantiation.

The instance variable .default_factory will hold the first argument passed into defaultdict.__init__() . This argument can take a valid Python callable or None . If a callable is provided, then it’ll automatically be called by defaultdict whenever you try to access or modify the value associated with a missing key.

Note:All the remaining arguments to the class initializer are treated as if they were passed to the initializer of regular dict , including the keyword arguments.

Take a look at how you can create and properly initialize a defaultdict :

>>>
>>> # Correct instantiation
>>> def_dict = defaultdict(list)  # Pass list to .default_factory
>>> def_dict['one'] = 1  # Add a key-value pair
>>> def_dict['missing']  # Access a missing key returns an empty list
[]
>>> def_dict['another_missing'].append(4)  # Modify a missing key
>>> def_dict
defaultdict(<class 'list'>, {'one': 1, 'missing': [], 'another_missing': [4]})

Here, you pass list to .default_factory when you create the dictionary. Then, you use def_dict just like a regular dictionary. Note that when you try to access or modify the value mapped to a non-existent key, the dictionary assigns it the default value that results from calling list() .

Keep in mind that you must pass a valid Python callable object to .default_factory , so remember not to call it using the parentheses at initialization time. This can be a common issue when you start using the Python defaultdict type. Take a look at the following code:

>>>
>>> # Wrong instantiation
>>> def_dict = defaultdict(list())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
    def_dict = defaultdict(list())
TypeError: first argument must be callable or None

Here, you try to create a defaultdict by passing list() to .default_factory . The call to list() raises a TypeError , which tells you that the first argument must be callable or None .

With this introduction to the Python defaultdict type, you can get start coding with practical examples. The next few sections will walk you through some common use cases where you can rely on a defaultdict to provide an elegant, efficient, and Pythonic solution.

Using the Python defaultdict Type

Sometimes, you’ll use a mutable built-in collection (a list , dict , or set ) as values in your Python dictionaries. In these cases, you’ll need to initialize the keys before first use, or you’ll get a KeyError . You can either do this process manually or automate it using a Python defaultdict . In this section, you’ll learn how to use the Python defaultdict type for solving some common programming problems:

  • Grouping the items in a collection
  • Counting the items in a collection
  • Accumulating the values in a collection

You’ll be covering some examples that use list , set , int , and float to perform grouping, counting, and accumulating operations in a user-friendly and efficient way.

Grouping Items

A typical use of the Python defaultdict type is to set .default_factory to list and then build a dictionary that maps keys to lists of values. With this defaultdict , if you try to get access to any missing key, then the dictionary runs the following steps:

  1. Call list() to create a new empty list
  2. Insert the empty list into the dictionary using the missing key as key
  3. Return a reference to that list

This allows you to write code like this:

>>>
>>> from collections import defaultdict
>>> dd = defaultdict(list)
>>> dd['key'].append(1)
>>> dd
defaultdict(<class 'list'>, {'key': [1]})
>>> dd['key'].append(2)
>>> dd
defaultdict(<class 'list'>, {'key': [1, 2]})
>>> dd['key'].append(3)
>>> dd
defaultdict(<class 'list'>, {'key': [1, 2, 3]})

Here, you create a Python defaultdict called dd and pass list to .default_factory . Notice that even when key isn’t defined, you can append values to it without getting a KeyError . That’s because dd automatically calls .default_factory to generate a default value for the missing key .

You can use defaultdict along with list to group the items in a sequence or a collection. Suppose that you’ve retrieved the following data from your company’s database:

Department Employee Name
Sales John Doe
Sales Martin Smith
Accounting Jane Doe
Marketing Elizabeth Smith
Marketing Adam Doe

With this data, you create an initial list of tuple objects like the following:

dep = [('Sales', 'John Doe'),
       ('Sales', 'Martin Smith'),
       ('Accounting', 'Jane Doe'),
       ('Marketing', 'Elizabeth Smith'),
       ('Marketing', 'Adam Doe')]

Now, you need to create a dictionary that groups the employees by department. To do this, you can use a defaultdict as follows:

from collections import defaultdict

dep_dd = defaultdict(list)
for department, employee in dep:
    dep_dd[department].append(employee)

Here, you create a defaultdict called dep_dd and use a for loop to iterate through your dep list. The statement dep_dd[department].append(employee) creates the keys for the departments, initializes them to an empty list, and then appends the employees to each department. Once you run this code, your dep_dd will look something like this:

>>>
defaultdict(<class 'list'>, {'Sales': ['John Doe', 'Martin Smith'],
                             'Accounting' : ['Jane Doe'],
                             'Marketing': ['Elizabeth Smith', 'Adam Doe']})

In this example, you group the employees by their department using a defaultdict with .default_factory set to list . To do this with a regular dictionary, you can use dict.setdefault() as follows:

dep_d = dict()
for department, employee in dep:
    dep_d.setdefault(department, []).append(employee)

This code is straightforward, and you’ll find similar code quite often in your work as a Python coder. However, the defaultdict version is arguably more readable, and for large datasets, it can also be a lotfaster and more efficient. So, if speed is a concern for you, then you should consider using a defaultdict instead of a standard dict .

Grouping Unique Items

Continue working with the data of departments and employees from the previous section. After some processing, you realize that a few employees have been duplicated in the database by mistake. You need to clean up the data and remove the duplicated employees from your dep_dd dictionary. To do this, you can use a set as the .default_factory and rewrite your code as follows:

dep = [('Sales', 'John Doe'),
       ('Sales', 'Martin Smith'),
       ('Accounting', 'Jane Doe'),
       ('Marketing', 'Elizabeth Smith'),
       ('Marketing', 'Elizabeth Smith'),
       ('Marketing', 'Adam Doe'),
       ('Marketing', 'Adam Doe'),
       ('Marketing', 'Adam Doe')]

dep_dd = defaultdict(set)
for department, employee in items:
    dep_dd[department].add(employee)

In this example, you set .default_factory to set . Sets are collections of unique objects , which means that you can’t create a set with repeated items. This is a really interesting feature of sets, which guarantees that you won’t have repeated items in your final dictionary.

Counting Items

If you set .default_factory to int , then your defaultdict will be useful for counting the items in a sequence or collection. When you call int() with no arguments, the function returns 0 , which is the typical value you’d use to initialize a counter.

To continue with the example of the company database, suppose you want to build a dictionary that counts the number of employees per department. In this case, you can code something like this:

>>>
>>> from collections import defaultdict
>>> dep = [('Sales', 'John Doe'),
...        ('Sales', 'Martin Smith'),
...        ('Accounting', 'Jane Doe'),
...        ('Marketing', 'Elizabeth Smith'),
...        ('Marketing', 'Adam Doe')]
>>> dd = defaultdict(int)
>>> for department, _ in dep:
...     dd[department] += 1
>>> dd
defaultdict(<class 'int'>, {'Sales': 2, 'Accounting': 1, 'Marketing': 2})

Here, you set .default_factory to int . When you call int() with no argument, the returned value is 0 . You can use this default value to start counting the employees that work in each department. For this code to work correctly, you need a clean dataset. There must be no repeated data. Otherwise, you’ll need to filter out the repeated employees.

Another example of counting items is the mississippi example, where you count the number of times each letter in a word is repeated. Take a look at the following code:

>>>
>>> from collections import defaultdict
>>> s = 'mississippi'
>>> dd = defaultdict(int)
>>> for letter in s:
...     dd[letter] += 1
...
>>> dd
defaultdict(<class 'int'>, {'m': 1, 'i': 4, 's': 4, 'p': 2})

In the above code, you create a defaultdict with .default_factory set to int . This sets the default value for any given key to 0 . Then, you use a for loop to traverse thestring s and use an augmented assignment operation to add 1 to the counter in every iteration. The keys of dd will be the letters in mississippi .

Note:Python’s augmented assignment operators are a handy shortcut to common operations.

Take a look at the following examples:

  • var += 1 is equivalent to var = var + 1
  • var -= 1 is equivalent to var = var - 1
  • var *= 1 is equivalent to var = var * 1

This is just a sample of how the augmented assignment operators work. You can take a look at the official documentation to learn more about this feature.

As counting is a relatively common task in programming, the Python dictionary-like class collections.Counter is specially designed for counting items in a sequence. With Counter , you can write the mississippi example as follows:

>>>
>>> from collections import Counter
>>> counter = Counter('mississippi')
>>> counter
Counter({'i': 4, 's': 4, 'p': 2, 'm': 1})

In this case, Counter does all the work for you! You only need to pass in a sequence, and the dictionary will count its items, storing them as keys and the counts as values. Note that this example works because Python strings are also a sequence type.

Accumulating Values

Sometimes you’ll need to calculate the total sum of the values in a sequence or collection. Let’s say you have the followingExcel sheet with data about the sales of your Python website:

Products July August September
Books 1250.00 1300.00 1420.00
Tutorials 560.00 630.00 750.00
Courses 2500.00 2430.00 2750.00

Next, you process the data using Python and get the following list of tuple objects:

incomes = [('Books', 1250.00),
           ('Books', 1300.00),
           ('Books', 1420.00),
           ('Tutorials', 560.00),
           ('Tutorials', 630.00),
           ('Tutorials', 750.00),
           ('Courses', 2500.00),
           ('Courses', 2430.00),
           ('Courses', 2750.00),]

With this data, you want to calculate the total income per product. To do that, you can use a Python defaultdict with float as .default_factory and then code something like this:

 1 from collections import defaultdict
 2 
 3 dd = defaultdict(float)
 4 for product, income in incomes:
 5     dd[product] += income
 6 
 7 for product, income in dd.items():
 8     print(f'Total income for {product}: ${income:,.2f}')

Here’s what this code does:

  • In line 1 , you import the Python defaultdict type.
  • In line 3 , you create a defaultdict object with .default_factory set to float .
  • In line 4 , you define a for loop to iterate through the items of incomes .
  • In line 5 , you use an augmented assignment operation ( += ) to accumulate the incomes per product in the dictionary.

The second loop iterates through the items of dd and prints the incomes to your screen.

Note:If you want to dive deeper into dictionary iteration, check out How to Iterate Through a Dictionary in Python .

If you put all this code into a file called incomes.py and run it from your command line, then you’ll get the following output:

$ python3 incomes.py
Total income for Books: $3,970.00
Total income for Tutorials: $1,940.00
Total income for Courses: $7,680.00

You now have a summary of incomes per product, so you can make decisions on which strategy to follow for increasing the total income of your site.

Diving Deeper Into defaultdict

So far, you’ve learned how to use the Python defaultdict type by coding some practical examples. At this point, you can dive deeper into type implementation and other working details. That’s what you’ll be covering in the next few sections.

defaultdict vs dict

For you to better understand the Python defaultdict type, a good exercise would be to compare it with its superclass, dict . If you want to know the methods and attributes that are specific to the Python defaultdict type, then you can run the following line of code:

>>>
>>> set(dir(defaultdict)) - set(dir(dict))
{'__copy__', 'default_factory', '__missing__'}

In the above code, you use dir() to get the list of valid attributes for dict and defaultdict . Then, you use a set difference to get the set of methods and attributes that you can only find in defaultdict . As you can see, the differences between these two classes are. You have two methods and one instance attribute. The following table shows what the methods and the attribute are for:

Method or Attribute Description
.__copy__() Provides support for copy.copy()
.default_factory Holds the callable invoked by .__missing__() to automatically provide default values for missing keys
.__missing__(key) Gets called when .__getitem__() can’t find key

In the above table, you can see the methods and the attribute that make a defaultdict different from a regular dict . The rest of the methods are the same in both classes.

Note:If you initialize a defaultdict using a valid callable, then you won’t get a KeyError when you try to get access to a missing key. Any key that doesn’t exist gets the value returned by .default_factory .

Additionally, you might notice that a defaultdict is equal to a dict with the same items:

>>>
>>> std_dict = dict(numbers=[1, 2, 3], letters=['a', 'b', 'c'])
>>> std_dict
{'numbers': [1, 2, 3], 'letters': ['a', 'b', 'c']}
>>> def_dict = defaultdict(list, numbers=[1, 2, 3], letters=['a', 'b', 'c'])
>>> def_dict
defaultdict(<class 'list'>, {'numbers': [1, 2, 3], 'letters': ['a', 'b', 'c']})
>>> std_dict == def_dict
True

Here, you create a regular dictionary std_dict with some arbitrary items. Then, you create a defaultdict with the same items. If you test both dictionaries for content equality, then you’ll see that they’re equal.

defaultdict.default_factory

The first argument to the Python defaultdict type must be a callable that takes no arguments and returns a value. This argument is assigned to the instance attribute, .default_factory . For this, you can use any callable, including functions, methods, classes, type objects, or any other valid callable. The default value of .default_factory is None .

If you instantiate defaultdict without passing a value to .default_factory , then the dictionary will behave like a regular dict and the usual KeyError will be raised for missing key lookup or modification attempts:

>>>
>>> from collections import defaultdict
>>> dd = defaultdict()
>>> dd['missing_key']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
    dd['missing_key']
KeyError: 'missing_key'

Here, you instantiate the Python defaultdict type with no arguments. In this case, the instance behaves like a standard dictionary. So, if you try to access or modify a missing key, then you’ll get the usual KeyError . From this point on, you can use dd as a normal Python dictionary and, unless you assign a new callable to .default_factory , you won’t be able to use the ability of defaultdict to handle missing keys automatically.

If you pass None to the first argument of defaultdict , then the instance will behave the same way you saw in the above example. That’s because .default_factory defaults to None , so both initializations are equivalent. On the other hand, if you pass a valid callable object to .default_factory , then you can use it to handle missing keys in a user-friendly way. Here’s an example where you pass list to .default_factory :

>>>
>>> dd = defaultdict(list, letters=['a', 'b', 'c'])
>>> dd.default_factory
<class 'list'>
>>> dd
defaultdict(<class 'list'>, {'letters': ['a', 'b', 'c']})
>>> dd['numbers']
[]
>>> dd
defaultdict(<class 'list'>, {'letters': ['a', 'b', 'c'], 'numbers': []})
>>> dd['numbers'].append(1)
>>> dd
defaultdict(<class 'list'>, {'letters': ['a', 'b', 'c'], 'numbers': [1]})
>>> dd['numbers'] += [2, 3]
>>> dd
defaultdict(<class 'list'>, {'letters': ['a', 'b', 'c'], 'numbers': [1, 2, 3]})

In this example, you create a Python defaultdict called dd , then you use list for its first argument. The second argument is called letters and holds a list of letters. You see that .default_factory now holds a list object that will be called when you need to supply a default value for any missing key.

Notice that when you try to access numbers , dd tests if numbers is in the dictionary. If it’s not, then it calls .default_factory() . Since .default_factory holds a list object, the returned value is an empty list ( [] ).

Now that dd['numbers'] is initialized with an empty list , you can use .append() to add elements to the list . You can also use an augmented assignment operator ( += ) to concatenate the lists [1] and [2, 3] . This way, you can handle missing keys in a more Pythonic and more efficient way.

On the other hand, if you pass a non-callable object to the initializer of the Python defaultdict type, then you’ll get a TypeError like in the following code:

>>>
>>> defaultdict(0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
    defaultdict(0)
TypeError: first argument must be callable or None

Here, you pass 0 to .default_factory . Since 0 is not a callable object, you get a TypeError telling you that the first argument must be callable or None . Otherwise, defaultdict doesn’t work.

Keep in mind that .default_factory is only called from .__getitem__() and not from other methods. This means that if dd is a defaultdict and key is a missing key, then dd[key] will call .default_factory to provide a default value , but dd.get(key) still returns None instead of the value that .default_factory would provide. That’s because .get() doesn’t call .__getitem__() to retrieve the key .

Take a look at the following code:

>>>
>>> dd = defaultdict(list)
>>> # Calls dd.__getitem__('missing')
>>> dd['missing']
[]
>>> # Don't call dd.__getitem__('another_missing')
>>> print(dd.get('another_missing'))
None
>>> dd
defaultdict(<class 'list'>, {'missing': []})

In this code fragment, you can see that dd.get() returns None rather than the default value that .default_factory would provide. That’s because .default_factory is only called from .__missing__() , which is not called by .get() .

Notice that you can also add arbitrary values to a Python defaultdict . This means that you’re not limited to values with the same type as the values generated by .default_factory . Here’s an example:

>>>
>>> dd = defaultdict(list)
>>> dd
defaultdict(<class 'list'>, {})
>>> dd['string'] = 'some string'
>>> dd
defaultdict(<class 'list'>, {'string': 'some string'})
>>> dd['list']
[]
>>> dd
defaultdict(<class 'list'>, {'string': 'some string', 'list': []})

Here, you create a defaultdict and pass in a list object to .default_factory . This sets your default values to be empty lists. However, you can freely add a new key that holds values of a different type. That’s the case with the key string , which holds a str object instead of a list object.

Finally, you can always change or update the callable you initially assign to .default_factory in the same way you would do with any instance attribute:

>>>
>>> dd.default_factory = str
>>> dd['missing_key']
''

In the above code, you change .default_factory from list to str . Now, whenever you try to get access to a missing key, your default value will be an empty string ( '' ).

Depending on your use cases for the Python defaultdict type, you might need to freeze the dictionary once you finish creating it and make it read-only. To do this, you can set .default_factory to None after you finish populating the dictionary. This way, your dictionary will behave like a standard dict , which means you won’t have more automatically generated default values.

defaultdict vs dict.setdefault()

As you saw before, dict provides .setdefault() , which will allow you to assign values to missing keys on the fly. In contrast, with a defaultdict you can specify the default value up front when you initialize the container. You can use .setdefault() to assign default values as follows:

>>>
>>> d = dict()
>>> d.setdefault('missing_key', [])
[]
>>> d
{'missing_key': []}

In this code, you create a regular dictionary and then use .setdefault() to assign a value ( [] ) to the key missing_key , which wasn’t defined yet.

Note:You can assign any type of Python object using .setdefault() . This is an important difference compared to defaultdict if you consider that defaultdict only accepts a callable or None .

On the other hand, if you use a defaultdict to accomplish the same task, then the default value is generated on demand whenever you try to access or modify a missing key. Notice that, with defaultdict , the default value is generated by the callable you pass upfront to the initializer of the class. Here’s how it works:

>>>
>>> from collections import defaultdict
>>> dd = defaultdict(list)
>>> dd['missing_key']
[]
>>> dd
defaultdict(<class 'list'>, {'missing_key': []})

Here, you first import the Python defaultdict type from collections . Then, you create a defaultdict and pass list to .default_factory . When you try to get access to a missing key, defaultdict internally calls .default_factory() , which holds a reference to list , and assigns the resulting value (an empty list ) to missing_key .

The code in the above two examples does the same work, but the defaultdict version is arguably more readable, user-friendly, Pythonic, and straightforward.

Note:A call to a built-in type like list , set , dict , str , int , or float will return an empty object or zero for numeric types.

Take a look at the following code examples:

>>>
>>> list()
[]
>>> set()
set([])
>>> dict()
{}
>>> str()
''
>>> float()
0.0
>>> int()
0

In this code, you call some built-in types with no arguments and get an empty object or zero for the numeric types.

Finally, using a defaultdict to handle missing keys can be faster than using dict.setdefault() . Take a look a the following example:

# Filename: exec_time.py

from collections import defaultdict
from timeit import timeit

animals = [('cat', 1), ('rabbit', 2), ('cat', 3), ('dog', 4), ('dog', 1)]
std_dict = dict()
def_dict = defaultdict(list)

def group_with_dict():
    for animal, count in animals:
        std_dict.setdefault(animal, []).append(count)
    return std_dict

def group_with_defaultdict():
    for animal, count in animals:
        def_dict[animal].append(count)
    return def_dict

print(f'dict.setdefault() takes {timeit(group_with_dict)} seconds.')
print(f'defaultdict takes {timeit(group_with_defaultdict)} seconds.')

If yourun the script from your system’s command line, then you’ll get something like this:

$ python3 exec_time.py
dict.setdefault() takes 1.0281260240008123 seconds.
defaultdict takes 0.6704721650003194 seconds.

Here, you use timeit.timeit() to measure the execution time of group_with_dict() and group_with_defaultdict() . These functions perform equivalent actions, but the first uses dict.setdefault() , and the second uses a defaultdict . The time measure will depend on your current hardware, but you can see here that defaultdict is faster than dict.setdefault() . This difference can become more important as the dataset gets larger.

Additionally, you need to consider that creating a regular dict can be faster than creating a defaultdict . Take a look at this code:

>>>
>>> from timeit import timeit
>>> from collections import defaultdict
>>> print(f'dict() takes {timeit(dict)} seconds.')
dict() takes 0.08921320698573254 seconds.
>>> print(f'defaultdict() takes {timeit(defaultdict)} seconds.')
defaultdict() takes 0.14101867799763568 seconds.

This time, you use timeit.timeit() to measure the execution time of dict and defaultdict instantiation. Notice that creating a dict takes almost half the time of creating a defaultdict . This might not be a problem if you consider that, in real-world code, you normally instantiate defaultdict only once.

Also notice that, by default, timeit.timeit() will run your code a million times. That’s the reason for defining std_dict and def_dict out of the scope of group_with_dict() and group_with_defaultdict() in exec_time.py . Otherwise, the time measure will be affected by the instantiation time of dict and defaultdict .

At this point, you may have an idea of when to use a defaultdict rather than a regular dict . Here are three things to take into account:

  1. If your code is heavily base on dictionariesand you’re dealing with missing keys all the time, then you should consider using a defaultdict rather than a regular dict .

  2. If your dictionary items need to be initializedwith a constant default value, then you should consider using a defaultdict instead of a dict .

  3. If your code relies on dictionariesfor aggregating, accumulating, counting, or grouping values, and performance is a concern, then you should consider using a defaultdict .

You can consider the above guidelines when deciding whether to use a dict or a defaultdict .

defaultdict.__missing__()

Behind the scenes, the Python defaultdict type works by calling .default_factory to supply default values to missing keys. The mechanism that makes this possible is .__missing__() , a special method supported by all the standard mapping types, including dict and defaultdict .

Note:Note that .__missing__() is automatically called by .__getitem__() to handle missing keys and that .__getitem__() is automatically called by Python at the same time for subscription operations like d[key] .

So, how does .__missing__() work? If you set .default_factory to None , then .__missing__() raises a KeyError with the key as an argument. Otherwise, .default_factory is called without arguments to provide a default value for the given key . This value is inserted into the dictionary and finally returned. If calling .default_factory raises an exception, then the exception is propagated unchanged.

The following code shows a viable Python implementation for .__missing__() :

 1 def __missing__(self, key):
 2     if self.default_factory is None:
 3         raise KeyError(key)
 4     if key not in self:
 5         self[key] = self.default_factory()
 6     return self[key]

Here’s what this code does:

  • In line 1 , you define the method and its signature.
  • In lines 2 and 3 , you test to see if .default_factory is None . If so, then you raise a KeyError with the key as an argument.
  • In lines 4 and 5 , you check if the key is not in the dictionary. If it’s not, then you call .default_factory and assign its return value to the key .
  • In line 6 , you return the key as expected.

Keep in mind that the presence of .__missing__() in a mapping has no effect on the behavior of other methods that look up keys, such as .get() or .__contains__() , which implements the in operator. That’s because .__missing__() is only called by .__getitem__() when the requested key is not found in the dictionary. Whatever .__missing__() returns or raises is then returned or raised by .__getitem__() .

Now that you’ve covered an alternative Python implementation for .__missing__() , it would be a good exercise to try to emulate defaultdict with some Python code. That’s what you’ll be doing in the next section.

Emulating the Python defaultdict Type

In this section, you’ll be coding a Python class that will behave much like a defaultdict . To do that, you’ll subclass collections.UserDict and then add .__missing__() . Also, you need to add an instance attribute called .default_factory , which will hold the callable for generating default values on demand. Here’s a piece of code that emulates most of the behavior of the Python defaultdict type:

 1 import collections
 2 
 3 class my_defaultdict(collections.UserDict):
 4     def __init__(self, default_factory=None, *args, **kwargs):
 5         super().__init__(*args, **kwargs)
 6         if not callable(default_factory) and default_factory is not None:
 7             raise TypeError('first argument must be callable or None')
 8         self.default_factory = default_factory
 9 
10     def __missing__(self, key):
11         if self.default_factory is None:
12             raise KeyError(key)
13         if key not in self:
14             self[key] = self.default_factory()
15         return self[key]

Here’s how this code works:

  • In line 1, you import collections to get access to UserDict .

  • In line 3, you create a class that subclasses UserDict .

  • In line 4, you define the class initializer .__init__() . This method takes an argument called default_factory to hold the callable that you’ll use to generate the default values. Notice that default_factory defaults to None , just like in a defaultdict . You also need the *args and **kwargs for emulating the normal behavior of a regular dict .

  • In line 5, you call the superclass .__init__() . This means that you’re calling UserDict.__init__() and passing *args and **kwargs to it.

  • In line 6, you first check if default_factory is a valid callable object. In this case, you use callable(object) , which is a built-in function that returns True if object appears to be a callable and otherwise returns False . This check ensures that you can call .default_factory() if you need to generate a default value for any missing key . Then, you check if .default_factory is not None .

  • In line 7, you raise a TypeError just like a regular dict would do if default_factory is None .

  • In line 8, you initialize .default_factory .

  • In line 10, you define .__missing__() , which is implemented as you saw before. Recall that .__missing__() is automatically called by .__getitem__() when a given key is not in a dictionary.

If you feel in the mood to read someC code, then you can take a look at the full code for the Python defaultdict Type in theCPython source code.

Now that you’ve finished coding this class, you can test it by putting the code into a Python script called my_dd.py and importing it from an interactive session. Here’s an example:

>>>
>>> from my_dd import my_defaultdict
>>> dd_one = my_defaultdict(list)
>>> dd_one
{}
>>> dd_one['missing']
[]
>>> dd_one
{'missing': []}
>>> dd_one.default_factory = int
>>> dd_one['another_missing']
0
>>> dd_one
{'missing': [], 'another_missing': 0}
>>> dd_two = my_defaultdict(None)
>>> dd_two['missing']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
    dd_two['missing']
  File "/home/user/my_dd.py", line 10,
 in __missing__
    raise KeyError(key)
KeyError: 'missing'

Here, you first import my_defaultdict from my_dd . Then, you create an instance of my_defaultdict and pass list to .default_factory . If you try to get access to a key with a subscription operation, like dd_one['missing'] , then .__getitem__() is automatically called by Python. If the key is not in the dictionary, then .__missing__() is called, which generates a default value by calling .default_factory() .

You can also change the callable assigned to .default_factory using a normal assignment operation like in dd_one.default_factory = int . Finally, if you pass None to .default_factory , then you’ll get a KeyError when trying to retrieve a missing key.

Note:The behavior of a defaultdict is essentially the same as this Python equivalent. However, you’ll soon note that your Python implementation doesn’t print as a real defaultdict but as a standard dict . You can modify this detail by overriding .__str__() and .__repr__() .

You may be wondering why you subclass collections.UserDict instead of a regular dict for this example. The main reason for this is that subclassing built-in types can be error-prone because the C code of the built-ins doesn’t seem to consistently call special methods overridden by the user.

Here’s an example that shows some issues that you can face when subclassing dict :

>>>
>>> class MyDict(dict):
...     def __setitem__(self, key, value):
...         super().__setitem__(key, None)
...
>>> my_dict = MyDict(first=1)
>>> my_dict
{'first': 1}
>>> my_dict['second'] = 2
>>> my_dict
{'first': 1, 'second': None}
>>> my_dict.setdefault('third', 3)
3
>>> my_dict
{'first': 1, 'second': None, 'third': 3}

In this example, you create MyDict , which is a class that subclasses dict . Your implementation of .__setitem__() always sets values to None . If you create an instance of MyDict and pass a keyword argument to its initializer, then you’ll notice the class is not calling your .__setitem__() to handle the assignment. You know that because the key first wasn’t assigned None .

By contrast, if you run a subscription operation like my_dict['second'] = 2 , then you’ll notice that second is set to None rather than to 2 . So, this time you can say that subscription operations call your custom .__setitem__() . Finally, notice that .setdefault() doesn’t call .__setitem__() either, because your third key ends up with a value of 3 .

UserDict doesn’t inherit from dict but simulates the behavior of a standard dictionary. The class has an internal dict instance called .data , which is used to store the content of the dictionary. UserDict is a more reliable class when it comes to creating custom mappings . If you use UserDict , then you’ll be avoiding the issues you saw before. To prove this, go back to the code for my_defaultdict and add the following method:

 1 class my_defaultdict(collections.UserDict):
 2     # Snip
 3     def __setitem__(self, key, value):
 4         print('__setitem__() gets called')
 5         super().__setitem__(key, None)

Here, you add a custom .__setitem__() that calls the superclass .__setitem__() , which always sets the value to None . Update this code in your script my_dd.py and import it from an interactive session as follows:

>>>
>>> from my_dd import my_defaultdict
>>> my_dict = my_defaultdict(list, first=1)
__setitem__() gets called
>>> my_dict
{'first': None}
>>> my_dict['second'] = 2
__setitem__() gets called
>>> my_dict
{'first': None, 'second': None}

In this case, when you instantiate my_defaultdict and pass first to the class initializer, your custom __setitem__() gets called. Also, when you assign a value to the key second , __setitem__() gets called as well. You now have a my_defaultdict that consistently calls your custom special methods. Notice that all the values in the dictionary are equal to None now.

Passing Arguments to .default_factory

As you saw earlier, .default_factory must be set to a callable object that takes no argument and returns a value. This value will be used to supply a default value for any missing key in the dictionary. Even when .default_factory shouldn’t take arguments, Python offers some tricks that you can use if you need to supply arguments to it. In this section, you’ll cover two Python tools that can serve this purpose:

  1. lambda
  2. functools.partial()

With these two tools, you can add extra flexibility to the Python defaultdict type. For example, you can initialize a defaultdict with a callable that takes an argument and, after some processing, you can update the callable with a new argument to change the default value for the keys you’ll create from this point on.

Using lambda

A flexible way to pass arguments to .default_factory is to use lambda . Suppose you want to create a function to generate default values in a defaultdict . The function does some processing and returns a value, but you need to pass an argument for the function to work correctly. Here’s an example:

>>>
>>> def factory(arg):
...     # Do some processing here...
...     result = arg.upper()
...     return result
...
>>> def_dict = defaultdict(lambda: factory('default value'))
>>> def_dict['missing']
'DEFAULT VALUE'

In the above code, you create a function called factory() . The function takes an argument, does some processing, and returns the final result. Then, you create a defaultdict and use lambda to pass the string 'default value' to factory() . When you try to get access to a missing key, the following steps are run:

  1. The dictionary def_dict calls its .default_factory , which holds a reference to a lambda function.
  2. The lambda function gets called and returns the value that results from calling factory() with 'default value' as an argument.

If you’re working with def_dict and suddenly need to change the argument to factory() , then you can do something like this:

>>>
>>> def_dict.default_factory = factory('another default value')
>>> def_dict['another_missing']
'ANOTHER DEFAULT VALUE'

This time, factory() takes a new string argument ( 'another default value' ). From now on, if you try to access or modify a missing key, then you’ll get a new default value, which is the string 'ANOTHER DEFAULT VALUE' .

Finally, you can possibly face a situation where you need a default value that’s different from 0 or [] . In this case, you can also use lambda to generate a different default value . For example, suppose you have a list of integer numbers, and you need to calculate the cumulative product of each number. Then, you can use a defaultdict along with lambda as follows:

>>>
>>> from collections import defaultdict
>>> lst = [1, 1, 2, 1, 2, 2, 3, 4, 3, 3, 4, 4]
>>> def_dict = defaultdict(lambda: 1)
>>> for number in lst:
...     def_dict[number] *= number
...
>>> def_dict
defaultdict(<function <lambda> at 0x...70>, {1: 1, 2: 8, 3: 27, 4: 64})

Here, you use lambda to supply a default value of 1 . With this initial value, you can calculate the cumulative product of each number in lst . Notice that you can’t get the same result using int because the default value returned by int is always 0 , which is not a good initial value for the multiplication operations you need to perform here.

Using functools.partial()

functools.partial(func, *args, **keywords) is a function that returns a partial object. When you call this object with the positional arguments ( args ) and keyword arguments ( keywords ), it behaves similar to when you call func(*args, **keywords) . You can take advantage of this behavior of partial() and use it to pass arguments to .default_factory in a Python defaultdict . Here’s an example:

>>>
>>> def factory(arg):
...     # Do some processing here...
...     result = arg.upper()
...     return result
...
>>> from functools import partial
>>> def_dict = defaultdict(partial(factory, 'default value'))
>>> def_dict['missing']
'DEFAULT VALUE'
>>> def_dict.default_factory = partial(factory, 'another default value')
>>> def_dict['another_missing']
'ANOTHER DEFAULT VALUE'

Here, you create a Python defaultdict and use partial() to supply an argument to .default_factory . Notice that you can also update .default_factory to use another argument for the callable factory() . This kind of behavior can add a lot of flexibility to your defaultdict objects.

Conclusion

The Python defaultdict type is a dictionary-like data structure provided by the Python standard library in a module called collections . The class inherits from dict , and its main added functionality is to supply default values for missing keys. In this tutorial, you’ve learned how to use the Python defaultdict type for handling the missing keys in a dictionary.

You’re now able to:

  • Create and use a Python defaultdict to handle missing keys
  • Solve real-world problems related to grouping, counting, and accumulating operations
  • Know the implementation differences between defaultdict and dict
  • Decide when and why to use a Python defaultdict rather than a standard dict

The Python defaultdict type is a convenient and efficient data structure that’s designed to help you out when you’re dealing with missing keys in a dictionary. Give it a try and make your code faster, more readable, and more Pythonic!


以上所述就是小编给大家介绍的《Real Python: Using the Python defaultdict Type for Handling Missing Keys》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

PHP高级程序设计

PHP高级程序设计

Kevin McArthur / 汪泳 等 / 人民邮电出版社出版 / 2009.7 / 45.00元

今天,PHP已经是无可争议的Web开发主流语言。PHP 5以后,它的面向对象特性也足以与Java和C#相抗衡。然而,讲述PHP高级特性的资料一直缺乏,大大影响了PHP语言的深入应用。 本书填补了这一空白。它专门针对有一定经验的PHP程序员,详细讲解了对他们最为重要的主题:高级面向对象、设计模式、文档、测试和标准PHP库等内容。同时,为适应目前Web开发的新趋势,作者还全面探讨了MVC架构和Z......一起来看看 《PHP高级程序设计》 这本书的介绍吧!

CSS 压缩/解压工具
CSS 压缩/解压工具

在线压缩/解压 CSS 代码

URL 编码/解码
URL 编码/解码

URL 编码/解码

RGB CMYK 转换工具
RGB CMYK 转换工具

RGB CMYK 互转工具