Hunchentoot: acceptor code review
This post is part of a series on Common Lisp WWWism, more specifically of ongoing work to understand the web server known as Hunchentoot. Even more specifically, this post exhaustively covers each of the methods and functions implemented by the acceptor class.
Let's begin with my own version of (ac)counting the functions in acceptor.lisp. First, we read all the top-level S-expressions in the file:
(defparameter *top-level-sexprs*
(with-open-file (in "acceptor.lisp")
(loop for sexpr = (read in nil :err)
until (eq sexpr :err)
collect sexpr)))
and after fiddling around with fixing read-time weird encounters, we keep expressions that are lists beginning with "defmethod" or "defun":
(defparameter *funcs*
(loop for def in *top-level-sexprs*
with acc = nil do
(when (and (consp def) (member (car def) '(defmethod defun)))
(push (list (car def) (cadr def)) acc))
finally (return (nreverse acc))))
Then we:
> (length *funcs*)
27
which is weird, because counting them by hand (for the purpose of e.g. determining the number of lines occupied by each function) yields no less than 30; which actually makes sense, because some of the definitions have a "#+lispworks" in front of them, which causes the CL reader to ignore them. So the first observation is that the Hunchentoot coad is full of ifdefism of the lowest kind. And since I don't have a Lispworks and I'm not planning to support it, my branch of the Hunchentoot V tree will be devoid of any "#+lispworks", and ultimately of any "#+..." whatsoever.
The second observation is that most methods -- notable exceptions being process-connection and acceptor-status-message -- are actually quite small and easy to read; most of the gnarl lies in heavy "systems code" that has to do corner-case handling for some particularity of e.g. the HTTP protocol, and in some cases even that could be refactored into something sane. And now without further ado, we dive straight into the function/method definitions. Note the organization: <anchor> <function name with link to source code>: <description>, which should help the reader to more easily navigate this maze of references.
[ddd] default-document-directory: Function for finding the "default document root", i.e. the "www/" directory in the Hunchentoot coad base. This directory contains a test page and the pre-generated Hunchentoot documentation (otherwise found in "doc"). I'm not convinced this is really needed, other than perhaps for a quick test of new Hunchentoot installations. Then again, any other filesystem path that the operator desires could be set as a document root for that purpose.
[po] print-object: Implementation for the printing method, called when e.g. "print" is called on an acceptor object.
[ii] initialize-instance: method for initializing acceptor objects, called when "make-instance" is called. Note the ":after" method qualifier, specifying that this implementation of initialize-instance is called after that of the superclass.
This defmethod: a. sets the persistent-connections-p field to nil for single-threaded taskmasters; and b. sets the document root and error template directory to their so-called physical pathname. In particular (b) allows using so-called "logical pathnames", which I'll admit I never used, but which supposedly aim to implement a sort of pseudo-GNS for refering to files using filesystem-agnostic resource locators.
[s] start: a. sets "self" acceptor as the acceptor of its current taskmaster (why isn't this being done in initialize-instance?); b. calls start-listening; c. calls the taskmaster's execute-acceptor.
[s2] stop: a. sets acceptor-shutdown-p to true; b. calls wake-acceptor-for-shutdown; c. when "soft" is set to true, waits on the shutdown-queue condition variable for the next outstanding request to be served; d. calls the taskmaster's shutdown method; e. closes and loses the reference to the acceptor's listening socket, thus allowing it to be garbage collected.
Note on (c): while I'm not sure this "soft shutdown" mechanism is of much use at all1, we observe that it escapes the problem of indefinite postponment by waiting on only the first in-progress request to be served. That is, if the server wanted to play really nicely, it would wait on that condition variable until requests-in-progress went down to zero; however, in this case some really mean fucker who does not want to let us finish could have generated requests ad infinitum, or in any case, until the next power outage. Fortunately, whoever designed this wasn't that naive.
[wafs] wake-acceptor-for-shutdown: "Creates a dummy connection to the acceptor, waking accept-connections while it is waiting. This is supposed to force a check of acceptor-shutdown-p." And you have to admit, that's a pretty weird way of forcing the check too, but it works; anyway, a much simpler approach than using CFFI and epoll.
[ics] initialize-connection-stream: By default it returns the stream object as it was passed to it (and naught else), but sub-acceptors can override/extend this behaviour. Off the top of my head I'm not sure why I'd need to implement a sub-method of this myself, for now it's just that the "SSL acceptor" (which is well on the way to extinction) uses it to do SSListic stuff.
[rcs] reset-connection-stream: "Resets the stream which is used to communicate between client and server after one request has been served so that it can be used to process the next request." We're not told what this "reset" means more exactly, so we're stuck trying to reverse it from the context. Note that similarly to the previous method, this can be extended by users to perform various actions on streams; only it's not called at the very beginning of a connection, but after each request.
Our reset-connection-stream function is called from process-connection after processing the current request and flushing the stream (why is "flushing" a separate operation from "resetting" anyway? no idea). One would expect thusly that "resetting" would have something to do with "request processing", only there's no apparent relation between the two: the method implementation only turns the connection stream from a "chunked stream" (assuming chunking2 is enabled, which it is, by default) to a "non-chunked" (regular) socket stream. I'm not sure exactly why this is done at this point, since chunking is later re-enabled either by the server and/or client upon response/request delivery.
[dwarci] do-with-acceptor-request-count-incremented: a. atomically increments requests-in-progress; b. calls a function received as a parameter; c. decrements requests-in-progress; d. if acceptor-shutdown-p is set, then it signals a wake-up through the shutdown-queue.
This is a reference-counting thing used to keep track of the number of requests in the progress of being served at any given time. It's used exclusively in conjunction with process-request.
[warci] with-acceptor-request-count-incremented: Wrapper around the previous function, which conveniently allows creating a context for executing code with said fields incremented.
[amr] acceptor-make-request: Given all request parameters (headers, "content stream", HTTP method, URI and HTTP protocol version), plus the remote and local addresses taken from the socket, instantiates a new request object.
[ds] detach-socket: This isn't called from, nor subclassed, anywhere. Judging by the fact that it sets *finish-processing-socket* (to true) and *close-hunchentoot-stream* (to nil), it's apparent that it signals to the (calling) process-connection that it should stop doing any further processing on the socket of the current connection and "abandon" it, i.e. hand over control over it to the user.
If there is an honest use case for this, I don't know it yet.
[pc] process-connection: This is a fucking huge beast, moreso that it calls most of the methods and functions defined above. Let's look at it.
There's also an ":around"-qualified method defined at line 363, which does nothing but wrap a call-next-method call in a context that ensures that all errors and warnings are logged by the server. Thus all other method implementations, including the one at 427 are run in this error handling context.
As for the main implementation, it: a. takes the current acceptor socket stream; b. initializes it; then c. runs a request processing loop (which we'll get to immediately); and d. if *close-hunchentoot-stream* is set, then it flushes and closes the socket stream under processing.
As for the processing loop, it: 1. checks whether it needs to shut down, in which case it ends; 2. calls get-request-data3, returning headers, URL, method and protocol version; 3. checks that there is a valid method; 4. does some gnarly handling for "transfer encodings", i.e. chunking, then 5. with-acceptor-request-count-incremented it calls process-request on a new request generated using acceptor-make-request; 6. it flushes the output stream; 7. resets it; and then finally, 8. if *finish-processing-socket* is set, it returns.
If you've been following closely, you'll see that the thing only looks horrible and that it otherwise makes sense for the most part, plus all the complexity involved in chunking, tracking request counts and the "socket abandoning" described above. All in all, a good candidate for refactoring, altough to be honest, if you wake me up tonight and ask me to do that, I won't know where to start.
[asp2] acceptor-ssl-p: This one's well on the way out.
[ala] acceptor-log-access: logs page accesses. Called upon sending a response to the client.
[alm] acceptor-log-message: Similar to the previous function, only it logs to message-log-destination. If you're familiar with Apache, then these are very similar to the AccessLog and ErrorLog facilities. Called from the taskmaster and from the following wrapper function.
[lms] log-message*: Convenient wrapper for the previous method, that logs for the currently bound *acceptor* special variable. This is the message logger used all around the Hunchentoot code.
[sl] start-listening: When the given acceptor isn't listening, create a new listening socket; call usocket's socket-listen, which does all the binding etc.
An ":after" method is also implemented, which sets the acceptor's port if it was set to zero, i.e. it was allocated by the operating system and at that point it's setting its actual value.
[ac] accept-connections: In a loop: a. if acceptor-shutdown-p is set, then return; otherwise b. wait-for-input, then c. socket-accept on the listening socket; d. set-timeouts for new socket; e. schedule connection handling for the new socket via taskmaster's handle-incoming-connection.
accept-connections is called by the taskmaster on execute-acceptor.
[adp] acceptor-dispatch-request: As previously discussed: a. if the document-root exists, then a static file is served; otherwise, b. a HTTP not-found is returned.
In particular, (a) calls handle-static-file; let's look at this one in more detail: it 1. checks that the file exists (using FAD) and replies with a HTTP not-found if it doesn't find it; 2. sets the content-type using mime-type4; 3. checks whether the file has been modified since some particular time specified in the request; 4. attempts to handle the range header, if it exists; 5. sends the headers; 6. in a loop, sends chunks of data while there are bytes available to send; and 7. flushes the output stream before exiting.
In particular, (b) calls abort-request-handler, which transfers control to handler-done, i.e. abruptly ends request handling and sends the reply down the output stream.
[hr] handle-request: called from within the "request processing" context; it wraps acceptor-dispatch-request in a condition catcher that recovers from any errors that might occur in the process of handling requests.
[asm] acceptor-status-message: Used by Hunchentoot to "compose" "status replies", e.g. 40x pages, served from e.g. process-request. This one's another beast, so I'm breaking it into as-small-as-possible pieces to ingurgitate. There's more than one definition for this method: the main one (736), a "default" one (721) and an ":around"-qualified one (724) -- let's take them one by one.
The ":around" method runs the first thing when acceptor-status-message is called; it calls the next method, wrapped in an error handling context that logs the error and falls back to make-cooked-message.
The "default" method is the one that gets called when all other calls fail -- notice the "(acceptor t)" specialization, t being the mother of all types. Similarly to the error handling case above, this one calls make-cooked-message.
The main method: a. defines some local procedures (which we'll examine below); b. ignores status codes smaller than 3005; and c. it calls one of the local procedures, error-contents-from-template, which tests whether a file "xxx.html" (where xxx is the status code) exists6 in the error-template-directory, and if the case, it serves it under a "request context variable" substitution. In order to serve the file, its contents are read using the file-contents local procedure and then the so-called variable substitution is performed using the substitute-request-context-variables local procedure.
file-contents reads the entire contents of the file in a buffer using read-sequence.
substitute-request-context-variables looks for variables between curly braces preceded by a dollar, such as, say, ${script-name} and then substitutes them according to the contents of "properties". Not sure it's obvious by now, but I believe this ad-hoc "context variable substitution" to be a terrible idea. I genesized cl-who for precisely this type of thing and I'm not going to have it replaced by some half-assed pseudo-solution concocted by Dorel in one of his sleepless nights7.
Anyway, from the acceptor-status-message series, we're left with:
[mcm] make-cooked-message: Given a HTTP status code and optional Lisp error and backtrace objects, it formats an Apache-style error page containing a short description of the status code. The bulk of this formatting occurs in the cooked-message local procedure, while the function body matches the HTTP code to the description message and calls cooked-message.
[sak] string-as-keyword: Used by acceptor-status-message to convert a string (denoting the name of something) into a symbol from the keyword package.
[ars] acceptor-remove-session: Used downstream for session cleanup, in remove-session and, I suspect, in session management code written by the user. At this level, the implementation doesn't do anything so it must exist mainly to be subclassed by users.
[asn] acceptor-server-name: "Returns a string which can be used for 'Server' headers." Clear enough from where I'm standing.
This is it for now. I only have three other fundamental Hunchentoot pieces to dissect in this manner, plus, as can be observed here, some of the auxiliary code.
Meta: given that proper in-blog comments are for now still missing, readers are invited to leave them in #spyked or (assuming they know where they're heading) #trilema. Existing comments are preserved here in 8.
Why would I ever want to, of all things, have the server wait for arbitrary clients to get served while, of all things, shutting down? The way I see it, when the operator sends the "stop" command to the server, the latter should get to it without any delay, regardless of what happens to clients' connections or requests.
What does the reader think about this?↩
"Chunking" is a... let's say feature of HTTP/1.1 which allows dividing a request and/or a response into so-called "chunks". Yes, you read that correctly.
So the first set of fuckers building the Internet came with this notion of "Maximum Transmissible Unit" (MTU), where a "unit" is a layer 3, i.e. IP packet; and so at the third (Internet) layer, packets are given lengths that must fit into this MTU, otherwise routers will fragment them into smaller pieces or drop them or whatever the fuck it is they do. Yes, a large subset of the IP spec deals with precisely this fragmentation thing.
Now, as if this wasn't enough, TCP also has a (transport layer) segment size, which must fit into a so-called "Maximum Segment Size" (MSS), which must be smaller than the MTU, because we also need to fit lower-layer headers and all that. Otherwise TCP isn't concerned too much with this, but misconfiguration can cause problems with congestion windows and whatnot, and we sure as hell don't want this shit to blow up.
Finally, as if the fuckers who designed the L3 shit and the ones who specced the L4 shit didn't add enough, here come the L7 idiots who, not being satisfied with a "file transfer protocol" decide to "support" file transfers over HTTP; and since files may be as large as, say, 1TB, then yes, splitting them into small chunks is very much preferable to sending the whole thing right away.
Which brings us to the following question: why would they want to add support for unknown-sized data to HTTP, of all things? What's wrong with defining a FToWP (File Transfer over Web Protocol) and using that for linking to divx porn or whatever. Yeah, it's I who's hatin', not they who decide to implement transfer for virtually everything within a protocol originally designed for delivering hypertext... right? Right.↩
In short, get-request-data parses the first HTTP request line, then calls chunga's read-http-headers, does some special handling on said headers and finally, it returns all the values.↩
mime-type implements the MIME crap by holding a hash table of associations between "file types", i.e. filename extensions, and MIME content types.
This approach is notoriously idiotic, as the extension of a file's name, that is, that thing that comes after the period -- assuming a period even exists and assuming one handles all the other idiocies such as "a period at the beginning of a filename denotes a hidden file, and not an extension marker" or "what's the difference between 'somefile' and 'somefile.'?"; and yes, Common Lisp does a pretty good job at handling all these, because e.g.
""
is different fromnil
-- has nothing to do with the content of said file. The fact that Microshits imagine this and then decide to force "nudepics.jpg.exe" down the throats of idiots has zilch to do with what we're trying to achieve here and this MIME thing is from this point of view identical, since a web server can lie all it wants about the nature of the content it's serving through the mere repopulation of this hash table.From where I'm looking, this then calls for a study into how browsers act upon the Content-Type header, which would I suppose inform some sane policy to allow clients to distinguish between, I suppose, binary, plain text and HTML data. Really, it's not my job to tell the client whether the "binary" he's getting is an image or an excel spreadsheet, might as well let 'im figure this out by 'imself already.↩
I don't know why it does that, or why acceptor-status-message would ever get executed for 1xx or 2xx codes. This will remain an exercise for later.↩
This time the test is performed using probe-file instead of the FAD code. Not much consistency there...↩
For the sake of posterity, and of preserving whatever heathen history I can, I'll leave here the culprit, one self-proclaimed "hacker" Hans Hübner, who's been patching this Hunchentoot for over a decade now. Well, no more.↩
Comments so far:
Comment #1: Asciilifeform writes:
asciilifeform: spyked: can pleeeez haz comments on your www ? ( at the risk of repeating mp_en_viaje's , srsly, 'no comment box' is quite discouraging in re reading)
a111: Logged on 2019-07-26 18:53 mp_en_viaje: im guessing for this one time http://p.bvulpes.com/pastes/fn1VH/?raw=true will have to be it ; but for the love of christ, what are you doing to me here ? am i going to simply ignore your articles because i know for a fact i'm not interestreded in reading something i can't comment on and you're forcing on me the dilemma of either not commenting at all or else losing it in pastes ? this won't do, if i use an hour to read a post ispyked: asciilifeform, sorry for the annoyance. will certainly do, but the effort to add comments is non-trivial. I'ma try to add structured commenting (box + comments section for each post) in a few weeks from now, but it'll still be a manually operated thing on my side for a while.
Comment #2: Asciilifeform writes:
asciilifeform: spyked: imho your 'hunchentoot' vivisection illustrates important point : just how much of the complexity of that thing is on acct of idjit tcpism's shit abstractions, i.e. the lengths to which it goes to pretend that the machines aint exchanging short packets in quasi-reliable ordering
asciilifeform: 'acceptors', 'persistent connections', various streamisms, 'listeners', and other 'i can't believe it's not serialport!111' horros
asciilifeform: *horrors
asciilifeform: for yrs nao, asciilifeform thought, 'why the everliving fuck not serve page as a set of luby packets'
asciilifeform: then -- 'magically' -- no moar 'acceptors', 'streams', 'listeners', 'chunkings'...asciilifeform: on top of this : could just as easily serve a page from cluster of boxes instead of merely 1 (there's nothing preventing the slices from being generated wherever you want)
asciilifeform: tcp was a 'gift' of profound retardation that 'keeps on giving', even to moar obvious extent than e.g. unix. it is single-handedly responsible for ~100% of the backbreaking complexicrud of apache, ssh, ftp, etc
asciilifeform: the net aint a serial port!! it dun behave like a serial port! and the spackle dun do any good, it cracks and peels when you so much as blow on it
a111: Logged on 2018-10-25 19:10 asciilifeform: when you add compatibility spackle, serious reader is not saved from reading the thing you spackled over -- on the contrary nao he has to read the original rubbish plus your spackle, however much it weighs.
a111: Logged on 2019-02-12 23:49 asciilifeform: mircea_popescu: thinking about it -- 'zxc' strikes me as a classic case of http://btcbase.org/log/2018-10-25#1866191 and impedence mismatch generally. it was clearly written as attempt to 'deterministic scheduler on ??? iron/os', but fails, cuz you can't actually spackle away impedence mismatch b/w the underlying platform and the proggy
aciilifeform: ever wonder why heathens still fascinated, like chukchas with radio found in taiga, with 'bittorrent' ? it's because warez goes at ~line rate~ over 'bittorrent'. and at maybe 2/3 line rate on http on a good weather day. why? cuz bt , despite authored by idiot, ~let go of tcpism~ !And so on.
Comment #3: Mircea Popescu writes:
mp_en_viaje: in other scandal, by the time i hit "[wafs] wake-acceptor-for-shutdown" in spyked 's story i'm so fucking pissed off i can't even continue reading.
mp_en_viaje: why the fuck is this hunchenback even written in lisp ? it's pure python, why not just write it in python and be happy.
asciilifeform: mp_en_viaje: i was only able to eat w/out choking because already, decade+ ago, had read 'hunchentoot' an' barfed
mp_en_viaje: this is not a lisp program.And so on. The problem being that:
phf: mp_en_viaje: it's the shitsoup ecosystem, there's a handful of packages smartly written, like hunchentoot or cl-http, and then there's new wave of cffi-everything approaches, that stand tall and pretend to be people
In other words, this is pretty much the best -- since CL-HTTP is nowhere to be found on the web, or anywhere else -- that Common Lisp has to offer in terms of web server software. And yes, I'm going to, as the man said:
mp_en_viaje: but the stance seems tenable that if we're gonna go to all the trouble to make and maintain lisp webserver, might as well make a non-fucked one and make new browsers for it
mp_en_viaje: and if not wanna actually fuck, then get out of the night club, go home, have a tea
mp_en_viaje: which rapidly collapses "lisp webserver" into gossipd and all that.asciilifeform: strictly speaking , udpistic page server doesn't even require gossipism. but ideally yes.
asciilifeform: ( why would ~not~ want ? conceivably, cuz rsa is expensive. at least until we have e.g. that 8192bit mips iron, or similar )mp_en_viaje: can start with a non-cyphered poc antyway
asciilifeform: an' if 'sane server', and 'fuck or go home and tea', may as well then also serve up sexpr instead of the html soup.
mp_en_viaje: yes.
mp_en_viaje: why the fuck would lisp webserver serve ANYTHING ELSEbut until then, what am I gonna do, implement The Tar Pit comments in PHP? So no, I'm going to genesize this as the best CL-WWW server currently in existence, so that the shame of Hans Hübner et al. remains public, and then I'm going to throw it all away as soon as TMSR-WWW comes to life.↩