Managing Context Metadata in Django-PGHistory: Solving the Persistence Problem
In this post you would learn how pghistory.context
works and some of its gotchas that we faced at Diversio(thanks to Amal Raj B R who identified this) and how we adjusted our approach to avoid saving context metadata that is not relevant to the current business logic but is being carried over from a parent context.
What is pghistory.context
?
pghistory.context
is a decorator and a function that is part of the fantastic django-pghistory
project. I would cover how we use django-pghistory
in a separate post.
When django-pghistory
is keeping track of changes to Django models, there are several instances where we want to include additional information with the tracked changes. This is where pghistory.context
comes into picture to easily capture such metadata.
We are making use of pghistory.middleware.HistoryMiddleware
to collect basic fields that we want to store with each change. This includes IP address, URL and the User who made the request. Some of this is done by the library itself (comments removed for brevity) as seen in the code below.
class HistoryMiddleware:
def __init__(self, get_response):
self.get_response = get_response
def get_context(self, request) -> Dict[str, Any]:
user = (
request.user._meta.pk.get_db_prep_value(request.user.pk, connection)
if hasattr(request, "user") and hasattr(request.user, "_meta")
else None
)
return {"user": user, "url": request.path}
def __call__(self, request):
if request.method in config.middleware_methods():
with pghistory.context(**self.get_context(request)):
if isinstance(request, DjangoWSGIRequest):
request.__class__ = WSGIRequest
elif isinstance(request, DjangoASGIRequest):
request.__class__ = ASGIRequest
return self.get_response(request)
else:
return self.get_response(request)
An interesting thing happening here is that the request-response is wrapped in the pghistory.context
context manager. That means all call to pghistory.context
anywhere in views would keep on accumulating the metadata that pghistory.context
is getting. This is a powerful pattern for audit tracking without having to manually pass context through your application code. Every database change automatically gets tagged with "who" (user) and "where" (URL) the change happened.
But this also has an unintended effect, let's use an example to explain it(gist)
In [1]: import pghistory
...:
...: with pghistory.context(url="/foo/bar", user="bugs-bunny") as pg:
...: print(f"Level 1-1: {pg.metadata}")
...:
...: with pghistory.context(foo="bar"):
...: print(f"Level 2-1: {pg.metadata}")
...: with pghistory.context(spam="eggs"):
...: print(f"Level 3-1: {pg.metadata}")
...:
...: with pghistory.context(hannah="montana"):
...: print(f"Level 3-2: {pg.metadata}")
...: with pghistory.context(timon="pumba"):
...: print(f"Level 4-1: {pg.metadata}")
...:
...: print(f"Level 1-2: {pg.metadata}")
...: with pghistory.context(monty="python"):
...: print(f"Level 2-2: {pg.metadata}")
...:
...:
...: with pghistory.context(guido="bdfl") as pg_2:
...: print(f"Level 1-4: {pg_2.metadata}")
...:
...: print(f"Level 1-5: {pg.metadata}")
...: print(f"Level 1-6: {pg_2.metadata}")
...:
Level 1-1: {'url': '/foo/bar', 'user': 'bugs-bunny'}
Level 2-1: {'url': '/foo/bar', 'user': 'bugs-bunny', 'foo': 'bar'}
Level 3-1: {'url': '/foo/bar', 'user': 'bugs-bunny', 'foo': 'bar', 'spam': 'eggs'}
Level 3-2: {'url': '/foo/bar', 'user': 'bugs-bunny', 'foo': 'bar', 'spam': 'eggs', 'hannah': 'montana'}
Level 4-1: {'url': '/foo/bar', 'user': 'bugs-bunny', 'foo': 'bar', 'spam': 'eggs', 'hannah': 'montana', 'timon': 'pumba'}
Level 1-2: {'url': '/foo/bar', 'user': 'bugs-bunny', 'foo': 'bar', 'spam': 'eggs', 'hannah': 'montana', 'timon': 'pumba'}
Level 2-2: {'url': '/foo/bar', 'user': 'bugs-bunny', 'foo': 'bar', 'spam': 'eggs', 'hannah': 'montana', 'timon': 'pumba', 'monty': 'python'}
Level 1-4: {'url': '/foo/bar', 'user': 'bugs-bunny', 'foo': 'bar', 'spam': 'eggs', 'hannah': 'montana', 'timon': 'pumba', 'monty': 'python', 'guido': 'bdfl'}
Level 1-5: {'url': '/foo/bar', 'user': 'bugs-bunny', 'foo': 'bar', 'spam': 'eggs', 'hannah': 'montana', 'timon': 'pumba', 'monty': 'python', 'guido': 'bdfl'}
Level 1-6: {'url': '/foo/bar', 'user': 'bugs-bunny', 'foo': 'bar', 'spam': 'eggs', 'hannah': 'montana', 'timon': 'pumba', 'monty': 'python', 'guido': 'bdfl'}
Here as we can see, regardless of which level of context manager we are in, or even if it's a completely new context manager, the keys keep on accumulating.
This is not something we want because there are multiple places in our business logic where we want to track different changes with different metadata instead of using the metadata from previous calls to pghistory.context
.
def my_view(request):
if request.method == "POST":
with pghistory.context(type="POST"):
# Perform some action
# save audit log
pass
elif request.method == "GET":
with pghistory.context(type="GET"):
# Perform some action
# save audit log
pass
# External API call
# Update an internal state
with pghistory.context(type="EXTERNAL"):
# Perform some action
# save audit log
pass
# Another action that updates the internal state
with pghistory.context(type="INTERNAL"):
# Perform some action
# save audit log
pass
Now for the above, the ideal behaviour we want is that every audit log should get the keys that were included in its context, but not the ones added by other contexts, apart from the common keys coming from the middleware.
Solution
We want a solution where we want the keys added by the middleware to persist, but we do not want the keys added inside the views to persist, and every time we exit from a context the keys should be removed from the metadata.
For this the solution was to override the default implementation of pghistory.context
. Let's say the custom implementation is called custom_pg_context
, then we want the output of the program we had earlier on to be:
The key change we made was introducing a new parameter called persist
. When set to True
(the default), metadata keys will persist across nested contexts and remain available to other contexts. When set to False
, metadata keys added in the current context will automatically be removed upon exiting the context, ensuring they don't leak into subsequent operations.
Here only want the user
and url
to persist (later on timon
as well)
Level 1-1: {'url': '/foo/bar', 'user': 'bugs-bunny'}
Level 2-1: {'url': '/foo/bar', 'user': 'bugs-bunny', 'foo': 'bar'}
Level 3-1: {'url': '/foo/bar', 'user': 'bugs-bunny', 'foo': 'bar', 'spam': 'eggs'}
# Notice the lack of 'spam': 'eggs' in 3-2
Level 3-2: {'url': '/foo/bar', 'user': 'bugs-bunny', 'foo': 'bar', 'hannah': 'montana'}
Level 4-1: {'url': '/foo/bar', 'user': 'bugs-bunny', 'foo': 'bar', 'hannah': 'montana', 'timon': 'pumba'}
# Everything has been removed when we hit 1-2 except for url and user
# and timon, because these we want to persist.
Level 1-2: {'url': '/foo/bar', 'user': 'bugs-bunny', 'timon': 'pumba'}
Level 2-2: {'url': '/foo/bar', 'user': 'bugs-bunny', 'timon': 'pumba', 'monty': 'python'}
Level 1-4: {'url': '/foo/bar', 'user': 'bugs-bunny', 'timon': 'pumba', 'guido': 'bdfl'}
Level 1-5: {'url': '/foo/bar', 'user': 'bugs-bunny', 'timon': 'pumba', 'guido': 'bdfl'}
Level 1-6: {'url': '/foo/bar', 'user': 'bugs-bunny', 'timon': 'pumba', 'guido': 'bdfl'}
Now this is what the code looks like now(gist):
from __future__ import annotations
from contextvars import ContextVar
from copy import deepcopy
import pghistory
context_stack: ContextVar[list[custom_pg_context]] = ContextVar(
"context_stack", default=[]
)
class custom_pg_context(pghistory.context):
def __init__(self, *, persist: bool = True, **metadata):
self._persist = persist
self._local_metadata = deepcopy(metadata)
self.token = context_stack.set(context_stack.get() + [self])
super().__init__(**metadata)
def __enter__(self):
return_value = super().__enter__()
self._tracker_value = return_value
return return_value
def __exit__(self, exc_type, exc_val, exc_tb):
try:
if not self._persist:
# Removing keys added in this context from parent contexts
stack = context_stack.get()[:-1]
for parent in stack:
if parent._tracker_value:
for key in self._local_metadata:
parent._tracker_value.metadata.pop(key, None)
return super().__exit__(exc_type, exc_val, exc_tb)
finally:
current_stack = context_stack.get()
if current_stack and current_stack[-1] is self:
context_stack.set(current_stack[:-1])
@staticmethod
def get_parent_context() -> custom_pg_context | None:
stack = context_stack.get()
return stack[-2] if len(stack) > 1 else None
with custom_pg_context(url="/foo/bar", user="bugs-bunny", persist=True) as pg:
print(f"Level 1-1: {pg.metadata}")
with custom_pg_context(foo="bar", persist=False):
print(f"Level 2-1: {pg.metadata}")
with custom_pg_context(spam="eggs", persist=False):
print(f"Level 3-1: {pg.metadata}")
with custom_pg_context(hannah="montana", persist=False):
print(f"Level 3-2: {pg.metadata}")
with custom_pg_context(timon="pumba", persist=True):
print(f"Level 4-1: {pg.metadata}")
print(f"Level 1-2: {pg.metadata}")
with custom_pg_context(monty="python", persist=False):
print(f"Level 2-2: {pg.metadata}")
with custom_pg_context(guido="bdfl") as pg_2:
print(f"Level 1-4: {pg_2.metadata}")
print(f"Level 1-5: {pg.metadata}")
print(f"Level 1-6: {pg_2.metadata}")
What's happening here?
- We are making use of
ContextVar
from thecontextvars
module to maintain a stack of all contexts and identify parents and children of any particular context we are in. - We create a custom class
custom_pg_context
that inherits frompghistory.context
, with an additionalpersist
parameter that defaults toTrue
. - In the constructor, we keep a copy of the local metadata and add ourselves to the context stack.
- During
__enter__
, we store the tracker object returned by the parent class for later use. -
The magic happens in
__exit__
:- If
persist=False
, we look through all parent contexts in the stack and remove any keys from our local metadata. - This ensures that temporary metadata isn't carried forward to other contexts.
- Keys with
persist=True
(like ouruser
,url
, andtimon
) remain in the metadata across all contexts.
- If
-
We also maintain proper stack management by removing our context from the stack when exiting.
- The
get_parent_context()
static method provides a way to access the parent context if needed.
This implementation solves our problem by allowing us to control which metadata persists across different parts of our application. We can use persist=True
for global tracking (like user and URL from middleware) and persist=False
for local, context-specific metadata that shouldn't leak into other operations.
Using the Custom Context in Django
To implement this in a Django project, you would:
- Create a new module (e.g.,
core/context.py
) with thecustom_pg_context
implementation. - Update your middleware to use this custom context manager instead of the default one:
from core.context import custom_pg_context
class CustomHistoryMiddleware(HistoryMiddleware):
def __call__(self, request):
if request.method in config.middleware_methods():
# Use our custom context manager with persist=True
with custom_pg_context(persist=True, **self.get_context(request)):
if isinstance(request, DjangoWSGIRequest):
request.__class__ = WSGIRequest
elif isinstance(request, DjangoASGIRequest):
request.__class__ = ASGIRequest
return self.get_response(request)
else:
return self.get_response(request)
- In your views and services, use the custom context manager with
persist=False
for operation-specific metadata:
from core.context import custom_pg_context
def my_view(request):
# Use persist=False for view-specific metadata
with custom_pg_context(operation="view_details", persist=False):
# Perform database operations
item.save() # This will include the operation metadata
# Start a new context without leaking previous metadata
with custom_pg_context(operation="update_related", persist=False):
# Perform more operations
related_item.save() # Only includes this operation's metadata
Performance Considerations
One thing to note is that this approach does add some overhead, especially with deeply nested contexts. For most applications, this overhead is negligible compared to database operations, but it's something to keep in mind for performance-critical code paths.
If you're concerned about performance, you could optimize the implementation further:
- Use a more efficient data structure for the context stack
- Limit the depth of context nesting
- Consider using a profiler to identify bottlenecks in your specific use case
Conclusion
By extending pghistory.context
with our custom implementation, we've solved the metadata leakage problem while maintaining the convenience of the context-based tracking system. This approach gives us fine-grained control over which metadata persists across different operations, making our audit logs more accurate and meaningful.
In a real-world Django application, this means we can add specific tracking metadata to different operations without polluting the audit trail of unrelated operations, while still maintaining the global context from middleware.
Comments