Logging messages onto a strictly bounded queue is perilous
Imagine we have a complicated program with a logging framework in it which puts messages on a queue. One thread is responsible for taking log messages from the queue and writing them somewhere, such as a logstash instance or local database.
Ideally we don't want to ever drop messages on the floor, even if the threads that add log messages go faster than the logging thread can save the messages. We can't buffer messages forever or we'll eventually just run out of memory and crash, so the only way out that never loses messages is to force the threads generating the messages to slow down if they outrace the logging thread too much.
The obvious way to do this is with a blocking queue. Unfortunately, this is perilous in any system in which finalisers or signals can cause messages to be logged from any thread.
The setup:
Say we have the following program, which simulates the system described above:
try: import queue except ImportError: import Queue as queue from threading
import Thread, current_thread import time import os import sys def usage():
print("Usage: %s <MAX_SIZE>" % sys.argv[0]) print(" where MAX_SIZE is an
integer >= 0.") print("") print("If MAX_SIZE is 0, model an unbounded log
message queue,") print("otherwise model a queue holding up to MAX_SIZE
messages.") sys.exit(1) if not len(sys.argv) == 2: usage() MAX_SIZE =
int(sys.argv[1], base=10) if MAX_SIZE < 0: usage() message_queue =
queue.Queue(maxsize=MAX_SIZE) class HLD(object): def __del__(self): message =
u"%r destroyed %r\n" % (current_thread().name, self) os.write(1,
message.encode("utf-8")) message_queue.put(message) def __repr__(self): return
'HLD(%r)' % (id(self),) def in_read_thread(): while True: item =
message_queue.get() message = u"reader dequeued (%r)\n" % (item,) os.write(1,
message.encode("utf-8")) # This block allocates cyclic garbage in order to #
make the gc sometimes activate on the read thread. # For a faster repro,
replace this block with: # import gc; gc.collect() for _ in range(2): trash =
[object() for _ in range(100)] trash.append(trash) trash = None def
in_write_thread(): while True: # This block allocates cyclic garbage which has
references to HLD # objects. The HLD objects themselves aren't part of the
cycles, # so this cyclic garbage can be collected at some point. trash =
[HLD() for _ in range(10)] trash.append(trash) trash = None time.sleep(0.1)
message_queue.put(u"Hello from write thread!") read_thread =
Thread(target=in_read_thread) read_thread.name = "read_thread"
read_thread.daemon = True read_thread.start() write_thread =
Thread(target=in_write_thread) write_thread.name = "write_thread"
write_thread.daemon = True write_thread.start() while True: # Pause the main
thread indefinitely (until the program is interrupted # by pressing control-C
at the terminal). Use time.sleep() for this # rather than a lock or
Thread.join(), because Python 2 has a bug # wherein signals such as
KeyboardInterrupt can't be delivered # when the main thread is block on any
kind of mutex. # This bug appears to have been fixed in Python 3.
time.sleep(100)
If you run this program with the argument 0, it will simulate a system with an unbounded queue for log messages. If you run it with any positive argument, it will simulate a system with a strictly bounded queue for log messages.
Unfortunately, when the queue is bounded, it manifests a deadlock bug.
The HLD
class models a type of object which sometimes logs a
message from its finaliser (the __del__
method). This can happen
in practical frameworks where, for example, exceptions get raised from buggy
code in finalisers. We need to log these so we can find out about and fix the
bugs.
The read_thread
executes the function
in_read_thread()
. It accepts log messages from the log queue,
saves them somewhere (in this case, stdout) and in the process allocates some
garbage. The deliberate garbage allocations in this code are a simulation of
doing something slightly complicated with the messages, like formatting them
to put into a database through an ORM.
The write_thread
executes the function
in_write_thread()
. It writes log messages to the log queue and
allocates garbage, simulating the program doing real work and responding to
user requests. Some of the garbage it allocates has finalisers that will emit
log messages.
The punchline:
When the garbage collector kicks in, some HLD
objects will be
destroyed, have their finalisers run, and emit log messages.
If the garbage collector kicks in on the write_thread
then
nothing goes wrong; if there are too many of them all at once for the
read_thread
to deal with, the write_thread
will
simply slow down and wait for the read_thread
.
In the output from our simulation, this will tend to look like:
'write_thread' destroyed HLD(140364268959120) 'write_thread' destroyed HLD(140364268959056) reader dequeued (u"'write_thread' destroyed HLD(140364268958032)\n") 'write_thread' destroyed HLD(140364268958992) reader dequeued (u"'write_thread' destroyed HLD(140364268957968)\n") 'write_thread' destroyed HLD(140364268958928) reader dequeued (u"'write_thread' destroyed HLD(140364268957904)\n") 'write_thread' destroyed HLD(140364268958864) reader dequeued (u"'write_thread' destroyed HLD(140364268959120)\n") 'write_thread' destroyed HLD(140364268958800) reader dequeued (u"'write_thread' destroyed HLD(140364268959056)\n")
If the garbage collector kicks in on the read_thread
instead,
then the system can deadlock when the queue is bounded. When the finaliser for
a HLD
object is run on the read_thread
, it will try
to log a message to the queue. The queue might, at this point, be full. If it
is, then the reader thread will block in the HLD
object's
finaliser. It can't ever be unblocked, because it can't return from the
HLD
object's finaliser in order to drain the logging queue to
make room for the message it is currently waiting for a change to add!
In the output from our simulation, (with MAX_SIZE==4
) this looks
like:
reader dequeued (u'Hello from write thread!') reader dequeued (u'Hello from write thread!') reader dequeued (u'Hello from write thread!') 'read_thread' destroyed HLD(140364269227472) 'read_thread' destroyed HLD(140364269227408) 'read_thread' destroyed HLD(140364269227344) 'read_thread' destroyed HLD(140364269227280) 'read_thread' destroyed HLD(140364269227216)
And there it freezes, deadlocked.
Tragic!
The respite:
One way out of this conundrum is to move the message queue out of the process, into a separate process which does nothing but drain the queue and does not ever log messages itself.
Another possibility is to change the method for adding messages to the queue
to refrain from blocking if called from the read_thread
. However,
we're now back to worrying about how we can't buffer messages forever. If we
go down this route, we'll either have to drop messages on the floor, or hope
really hard that any messages that happen to get added by the logging
thread itself don't happen to be large enough to exhaust available memory.
This may be entirely practical; many programs don't ever allocate very many
objects with finalisers.
A less practical idea is to make sure that signals and finalisers can't cause
messages to be logged from the read_thread
. In Python, it's easy
enough to arrange for signals never to arrive on the thread which is draining
the queue - simply make sure that the log message queue is drained from some
thread that is not the 'main' thread, because Python only delivers signals to
the 'main' thread. However, avoiding logging messages from finalisers is not
so practical. If we have finalisers at all then eventually we'll
probably find bugs in them, and if our logging system doesn't help us get the
exception messages and tracebacks so we can try to fix them, then the logging
system isn't actually doing us any good.
One idea which will not work is simply adding more than one logging thread, in the hope that the second thread will always unblock the first. The big problem is that two (or more!) readers can both block if they each log a message at the wrong time. A smaller issue is that, if having more than one thread doing the actual writing-down of messages isn't a problem for, why use a queue at all? We could just change the logging mechanism to write down the messages directly without the queue.
© Copyright 2024 Richard Barrell