Closures and Partial Function Application in Python: A NLP Use Case
Contents
While creating some pipelines for automatic text annotation, I encountered a bug that made me realize I didn’t fully understand how closures work in Python. It’s important to note that in Python, a for-loop does not create a new scope or its own context, which can affect how closures behave.
When you’re working with multiple packages in Python, you’re essentially dealing with a complex ecosystem of code. You might often find that modifying components—be they functions, classes, or modules—is either impractical or impossible. This is especially true when those components are part of packages that are not under your control. Here’s where closures can provide value:
-
Data Encapsulation: Closures can enclose state—variables from the outer function that the inner function relies on. This encapsulation can effectively isolate this pocket of state, minimizing the risk of unintended side-effects when integrating with external packages.
-
Idiomatic Code: Pythonic idioms encourage readability and simplicity. Using closures can be a more Pythonic way to achieve specific kinds of encapsulation and state management without resorting to creating full-blown classes.
-
Reduced Mental Overhead: When you’re wrestling with a complex system, every bit of simplification helps. Closures can help you encapsulate specific behaviors and states into individual, manageable units without requiring you to understand or modify the complete architecture of an external package.
By focusing on these benefits, closures can sometimes serve as a more straightforward, clean alternative to complex inheritance hierarchies or class compositions when dealing with multiple external packages.
In this article, I’ll demonstrate the concept reconstructing functionalities from SpaCy and skweak. By the end, you should have a solid grasp of when and why to use closures, particularly in the context of Natural Language Processing (NLP).
What Are Closures?
Before we dive into the problem I encountered, let’s briefly talk about what closures are. A closure in Python is a function object that has access to variables in its local scope even after the function has finished execution. This allows for data to be hidden from the global scope, making it possible to encapsulate logic and state within a function.
Here’s a simple example:
|
|
8
SpaCy objects
SpaCy is here for you to help you build easy NLP pypelines.
Central to this package are the Doc
objects (short for document). It’s a neatly way to pack data for NLP and if it doesn’t provide what you need out of the box, you can always extend it’s functionalities to match your usecase.
|
|
By implementing __len__
and __getitem__
double-under functions we got the ability to iterate through the Doc’s tokens with a simple for as below. This is thanks to the Python datamodel. It’s outside the scope of this post, but learning to leverage the datamodel will pay dividends on your effectiveness in Python. Fluent Python introduces it in the first chapter in a very neat way. If you like video format more, James Powell got you covered.
|
|
Today
I
ate
garbonzo
beans
A Span
is a slice of a Doc
. Usually it can.. span multiple tokens, but today I have a feeling that all the spans we’ll look at will match exactly one token. Also, in our case the spans will be always labeled.
|
|
skweak functions
If you haven’t looked at the skweak repo yet, it suffices to know that it provides a neat way of composing simple annotators to get a better one.
Now, skweak provides us with some very interesting classes. One is a FunctionAnnotator
. This takes a function that returns a list of spans from a document and attaches these spans to the given document.
|
|
Let’s see a simple labeling function we may use
|
|
[Span(position=1, label='ANIMAL'), Span(position=6, label='ANIMAL')]
The FunctionAnnotatorAggregator
takes multiple annotator functions and combines them in some fancy way. We won’t do it justice with the implementation below.
We’ll just make it force our documents to have a maximum of one label per span. We will also sort them by the order of appearance.
|
|
|
|
[Span(position=1, label='ANIMAL'), Span(position=2, label='VERB'), Span(position=6, label='ANIMAL')]
The problem
The packages are well implemented and work as expected! Now, we may wish to programatically generate some labeling functions from a list of excellent heuristic parameters
|
|
|
|
[Span(position=6, label='BOVINE')]
What happened? It seems that only the last function was applied. Let’s look at the labeling_functions
|
|
<function labeling_function at 0x10372e200>
<function labeling_function at 0x10372f1c0>
<function labeling_function at 0x10372f250>
They point to different memory addresses. Let’s rewrite this with lambda functions.
Note if you haven’t worked with list comprehensions before: don’t worry about it; think of the code below as a way to create a new function without replacing the existing function with the same name
|
|
<function <listcomp>.<lambda> at 0x10372dc60>
<function <listcomp>.<lambda> at 0x10372f490>
<function <listcomp>.<lambda> at 0x10372f5b0>
But when we want to print the function the problem stays.
|
|
[Span(position=6, label='BIRD')]
This is because of scoping. The problem is that, since we didn’t declare strats_with, label in the lambda body or parameters, the lambdas will always look in the scope immediately outside them and they will find the last values that strats_with
, label
had.
If you come from other languages it might be strange to you, but Python doesn’t create a new scope or context inside the for
body. Instead it uses the same local scope. This is why strats_with, label = 'B', 'BOVINE'
in snippet 8 produced snippet 9 to display the label as ‘BOVINE’
But be not affraid! There is a solution:
|
|
Now, when we get the annotators things go as expected.
|
|
[Span(position=0, label='MAMMAL'), Span(position=3, label='FISH'), Span(position=6, label='BIRD')]
But why is this different from the last attempt? This time, by calling function_closure
we are creating a local scope around the labeling function and put the strats_with
and label
variables in it. These variables are recreated every time we call function_closure
. It also recreates labeling_function
since functions are regular objects in Python and different calls can’t trample on one another’s local variables.
A good mental model is to think of function values as containing both the code in their body and the environment in which they are created.*
*lifted from Eloquent Javascript book
Inspecting the functions will also confirm this:
|
|
<function function_closure.<locals>.labeling_function at 0x10372f640>
<class 'function'>
One improvement we can make is not using local_strats_with
, local_label
, since parameters are themselves local variables
|
|
[Span(position=0, label='MAMMAL'), Span(position=3, label='FISH'), Span(position=6, label='BIRD')]
Partial function application
Yet another way to do things, partial application is a concept where you fix a few arguments of a function and generate a new function:
|
|
[Span(position=0, label='MAMMAL'), Span(position=3, label='FISH'), Span(position=6, label='BIRD')]
Explanation:
labeling_function
is now a single function that accepts three parameters.
We use functools.partial
to “lock in” the first two parameters, starts_with
and label
.
This generates a new function for each pair of starts_with
and label
, which we then add to labeling_functions
.
Now compare it with how you’d implement a regular class:
|
|
[Span(position=0, label='MAMMAL'), Span(position=3, label='FISH'), Span(position=6, label='BIRD')]
The object is used only because it can function as a regular function - something that an actual regular function is more fit to do. This also requires you to come up with a naming convention for this kind of classes. And it doesn’t fit with the fact that skweak
expects a function (as the code and docstrings imply), even if it masquerades as one.
You can also do something like this:
|
|
[Span(position=0, label='MAMMAL'), Span(position=3, label='FISH'), Span(position=6, label='BIRD')]
<bound method LabelingCallable.labeling_function of <__main__.LabelingCallable object at 0x1037524a0>>
<class 'method'>
Maybe this is something you’ll eventually want but, for our intended purposes, this is basically a closure with extra steps.
Conclusion
In this post we explored closures, a solution for providing data locality.
As we’ve seen, Python closures and partial function application are powerful features for encapsulating local state. These techniques can be especially useful in NLP pipelines, allowing us to write clean, modular code. Whether you are working on a simple task or something complex like automated text annotation, understanding these language features can significantly improve your code quality and maintainability.
If this is the kind of content you enjoy, let me know!
Author Bogdan
LastMod 2021-11-06