The Trouble with PHP
This article was originally written for the Publ blog. I have reproduced a slightly modified version here so that it hopefully finds a wider audience.
Whenever I build a piece of software for the web, almost invariably somebody asks why I’m not using PHP to do it. While much has been written on this subject from a standpoint of what’s wrong with the language (and with which I agree quite a lot!), that isn’t, to me, the core of the problem with PHP on the web.
So, I want to talk a bit about some of the more fundamental issues with PHP, which actually goes back well before PHP even existed and is intractably linked with the way PHP applications themselves are installed and run.
(I will be glossing over a lot of details here.)
Some history
Back when the web was first created, it was all based around serving up static
files. You’d have an HTML file (usually served up from a public_html
directory
inside your user account on some server you had access to, which was sometimes
named or aliased www
but more often was just some random machine living on
your university’s network), and it acted much like a simplified version of FTP —
someone would go to a URL like http://example.com/~username/
and you’d see an
ugly directory index of the files in there (if you didn’t override it with an
index.html
or, more often in those days, index.htm
), and then someone would
click on the page they wanted to look at like homepage3.html
and it would
retrieve this file and whatever flaming skull .gif files it linked to in an
<img>
tag and the copy of canyon.mid
you put an <embed>
around, and that
would be that. The web server was really just a file server that happened to
speak HTTP.
Then one day, servers started supporting things called SSIs, short for “server-side includes.” This let you do some very simple templatization of your site; the server wouldn’t just serve up the HTML file directly, but it would scan it for simple SSI tags that told the server to replace this tag with another file, so that you could, for example, have a single navigation header that was shared between all your pages, and a common footer or whatever.
But this mechanism was still pretty limited, and so about two minutes later
someone came up with the idea of the Common Gateway
Interface, or CGI; this
would make it so the server would see a special URL like /cgi-bin/formail.pl
and instead of serving up the content of the file, it would run the file as a
separate program and serve up its output.
At this time, HTTP generally used just a single verb, GET
, which would get a
resource. CGI needed a way of passing in parameters to the program. Instead of
just running the program like a command line (which would be very insecure),
they passed in parameters through environment variables; for example, if the
user requested the “file” at /cgi-
bin/formail.pl?email=fwiffo@example.com&text=Hi+I+like+your+site!
, the web
server would set the environment variable QUERY_STRING
to the value of
everything after the ?
, which formail.pl
would then parse out.
If the POST
verb were used instead, then the server would also read some
additional data from the user’s web browser and then send that to the script via
its standard input.
Basically, the web server was no longer just a file server, but a primitive command processor.
Early security
Back when this first started, system administrators knew better than to let just anyone run just any program from the web server. After all, people might do silly things like make it very easy to execute arbitrary commands on the server — and since the web server often ran as the root/administrator user, this would be very bad indeed. Even the admins who were savvy enough to set up a special sandbox user for the HTTP server would still need it to run everything from a common, trusted account that might have had access to common areas of the server.
So, the usual approach was to have just a single /cgi-bin/
directory with
trusted programs that were vetted and installed by the administrator, for
things that they felt were important or useful for everyone to have. Usually
this would be things like standard guest books (the great-great-grandfather to
comment sections) or email contact forms (since spam was starting to become a
problem and it was already dangerous to put your email address on the public
web).
Back in these days people generally didn’t have a database — after all, Oracle
was expensive — and it didn’t really matter anyway; if you wanted to have a
complex website you’d just run some sort of static site generator (which was
often written in tcsh or Perl or something) and if you needed scheduled posts
you’d do it by having a cron
job periodically update things. So, it wasn’t
really that much of an impediment to have this setup.
If you were really savvy and wanted to run, say, an interactive online
multiplayer game of your own design, you’d simply run your own server (often
under your desk in your dorm room) and you’d have root access and could install
everything you wanted in /cgi-bin/
.
Because everything in /cgi-bin/
was run as a program, you knew better than
to let your scripts save other files into that same directory; if it was a thing
where people could upload files or post comments, it’s not like it would do any
good anyway (since then the server would try to run them as programs, and you
can’t run a .jpg).
Shared hosting
Then as the web really started to take off, shared hosting providers started
appearing, and CGI access became a pretty commonly-requested high-end feature.
Generally the shared hosting providers didn’t want to let just anyone upload a
script to be run by the server, but they also didn’t want to have to manually
vet each and every script that users wanted to install. So, as a compromise,
they set up special rules so that within your own server space you could have a
/cgi-bin/
directory and that things run from that directory would run under
your account, rather than as the web server (using a mechanism called suexec
).
This provided a pretty good compromise; users still had to know what they were
doing in order to install their scripts, but they still ran from a little
sandboxed location, and because of the way suexec
worked it was pretty
unlikely for even a very badly-written script to cause problems, because if the
script tried to save out an executable file into the cgi-bin
directory, it
wouldn’t be saved out with execute privileges, so it would just cause an error
500 to occur. After all, /cgi-bin/picture.jpg
wasn’t a program, so why should
it run?
Increased flexibility
But then things started to get a little more complicated. People wanted their
main index page to be able to run as a script, without it forwarding the page to
/cgi-bin/index.pl
or whatever.
So, another compromise happened: the CGI mechanism, which previously was set up
to only run the scripts from the /cgi-bin/
directory, got a few new rules,
such as “if the filename ends in .cgi
(or other common extensions like .pl
or .py
) run it as a script.” It still needed permissions to be set correctly,
though, and by this point suexec
was generally set up so that there were even
more rigorous checks before it would run the script. And there were so many
safety checks in place that this was still generally okay.
Around this time it also started becoming common to have access to a database such as mySQL or Postgresql, which allowed more flexibility and more two-way content. Forums became a thing. So did early blogs. Most of this software started out by having the database just for storage and the software would simply write out static files, but this started to have scaling problems and the web server got busy with the software writing these files out all the time, so it became more common for the software to simply read from the database directly as it ran. This helped somewhat, but it also shifted a significant amount of load over to constantly establishing short-lived database connections, because every time the forum program ran it had to connect.
Hello PHP
At some point, PHP started to get popular.
PHP itself was originally intended as another way of adding server-side
scripting into HTML files; it was in effect a templating system for HTML. In the
earliest days it was often just treated as another scripting language; the
server would be configured to consider .php
as another name for .cgi
or
.pl
or whatever, and the file would still be run as a script. In some cases it
even needed to start with #!/usr/local/bin/php
and it needed to be set
executable with the correct permissions and so on (although this setup was
uncommon).
However, most sites used mod_php
, a server extension that allowed the web
server to handle PHP files directly. In many respects it was very similar to
mod_cgi
, except it did a few interesting things. One of the undeniable
benefits was that it was now able to maintain the database connection
persistently, rather than having to re-establish a connection every time a
script ran. It was also generally a bit nicer for speed because commonly-used
PHP scripts could stay in memory and not have to be re-interpreted every time a
page was loaded.
But there were a couple of other implications this led to. In particular:
- It embedded the PHP interpreter into the web server itself (rather than running it as an external program)
- Since it was no longer shelling out to an external program, it could always run a .php file regardless of its execution permissions — and so that’s what it did
There were a few different variations on this and it didn’t always just run PHP from the web server (for example, some of the better hosts figured out that they could have each user run their own separate per-user FastCGI server that would run the PHP programs as the separate users, or whatever) but regardless of the setup, you now had PHP always running and not having to care about the permissions of the file, meaning you now had some persistent process running what was essentially executable code without the usual safeguards that a shared server would have.
This actually seemed like a good thing at the time, but then many, many pieces of software started allowing arbitrary people to upload images, and often wouldn’t make sure that what was supposedly an image was actually an image…
And so that’s where we stand today.
This makes sites potentially vulnerable even if they aren’t written in PHP
themselves; for example, if your HTML directory permissions are set to be
slightly too permissive, and another site on the server gets hacked, that hacked
site can potentially be used to place a .php
file into your site, and since
mod_php
doesn’t check ownership permissions it now runs on your site with
whatever permissions PHP would normally run in your account. (And this isn’t
just a theoretical; I’ve had sites hacked in this way! Now I run a nightly
script that ensures that my directory permissions are correct and tells me about
new .php
files that appeared since the last check, just to be sure.)
So, long story short, one of the biggest problems with PHP isn’t with the language itself, but with the way that PHP gets run; people (and their bots) can find ways to upload arbitrary files with a .php extension and, if that upload is visible to the web server (which it often will be), then a request to view that file will execute that file, regardless of its origin, and from there it can do anything that your own site can.
Other PHP features of note
Granted, the erroneously-executable upload feature is only responsible for some of the security exploits I’ve seen in the wild. I wasn’t really intending to get into language-specific issues (after all, I linked to much better, more- comprehensive articles about it in the introduction), but it’s worth mentioning some of them anyway, as I have seen all of these be used to hack websites I’ve helped to clean up and secure.
The biggest one: For a very long time, the include()
function would happily support
any arbitrary URL and would download and run whatever URL it was given. And it
was very easy for a PHP script to be accidentally written to allow an arbitrary
user to provide such an arbitrary URL. (And by “a very long time” I mean that
this was the default configuration until very recently, and many hosts still
configure it that way for backwards compatibility.)
Some might be looking at the PHP docs I linked to there and thinking, “wait, but it’s not running the PHP code locally.” What the docs mean are that if you do something like
include('http://example.com/foo.php');
it’s the output offoo.php
that gets included. However, that output could in turn be more PHP code, which would then be executed locally, meaning on your server. And PHP doesn’t even care what the file extension is; doing aninclude()
onasdf.txt
orpony.jpg
will happily execute whatever<?php ?>
blocks exist inside of it as well.
There’s also a few other features of PHP that lend itself to arbitrary code
execution. One particularly fun one was the PCRE e
flag, which indicated
that the result of the regular expression should be executed as arbitrary code;
and as PCRE flags are embedded into the regular expression itself, a carefully-
crafted search term (on a less-carefully-crafted search page) could run
arbitrary code. Fortunately, this has been removed in PHP 7; unfortunately, a
lot of web hosts still run PHP 5 (or older!) and so this option — which never
had a single legitimate usage — is still available on the vast majority of web
servers out there.
How application containers (nodejs, Flask, Django, etc.) are different
So, I originally wrote this article for the Publ blog, which implies that I’m trying to build a favorable comparison for Publ. And that’s a perfectly fine inference to take.
Publ is built on Flask, which runs in a self-contained server. There are a few different ways to deploy it such as WSGI (Web Server Gateway Interface) or by having a “reverse proxy” configuration or the like. This is a bit more complicated than I want to get into but the short version is that rather than the web server running a program based on the URL, Publ stays running as a standalone program that the web server sends commands to as requests come in. So, it’s never asking a file how it should be run, but instead it’s telling a single program to handle a request. So, there’s no danger of some random file being executed when it shouldn’t be.
“But wait,” you might ask, “isn’t that exactly what you were complaining about
mod_php
doing?” Well, sort of; mod_php
works by always having the PHP
interpreter running and able to execute whatever arbitrary code it comes across.
However, in this setup, code is kept separate from data. Loading a URL in Flask
isn’t mapping to a script file that gets loaded and run, it’s simply passing a
URL to a single fixed application that handles the URL accordingly. In the case
of Publ it loads a content file and formats it through a template.
Another thing that Flask does is it separates out template content (which is executable) from static file content. Static files aren’t executable by default. Templates can embed arbitrarily-complex code, but they can only use functions that are provided to them — there’s no direct access to the entire Python standard library, for example, and so the most dangerous functions aren’t included by default. (And Publ does not provide any of those functions either, at least not purposefully.)
Important note: When I say static files aren’t executable by default, this simply refers to how Publ sees them. If your site is configured to serve up static files where PHP or CGI scripts are executable, then any such scripts that end up in your static files will indeed be executable. This is going to be the case on pretty much any shared hosting provider, for example.
Also, regardless of the server setup, Publ can’t magically protect your content or template directory from outright misconfigurations with permissions. Even classic static sites need to be secured from third-party/unauthorized access.
Publ itself also only knows how to handle a handful of content formats —
Markdown, HTML, and images — and ignores everything else. So if a .php
file
somehow ends up in the content directory, it won’t matter at all — Publ just
ignores it. It will never attempt to run code that’s embedded in a content file,
nor does it even even know how to. And Publ doesn’t handle arbitrary user
uploads anyway (nor is there any plan to ever support this); anything that would
be potentially hazardous would have been put there by some other means.
Publ’s design is basically just a fancy way of presenting static files, just like in the early days of the web. It just serves up the static files dynamically. Or, as I keep on saying, Publ is like a static publishing system, only dynamic.
(Of course, if your directory permissions are set wrong, someone can still use someone else’s exploited PHP-based site to attack your account and modify Publ’s code. But there’s nothing that Flask or Publ can do to prevent that, and this is just a general security problem that impacts everyone regardless of what they’re running.)
It would of course be foolish of me to claim that Publ itself is 100% secure and impossible to hack. And at least on Dreamhost there’s the very real possibility that somehow an arbitrary .php file gets injected into the static files (perhaps by an incorrect directory permission or whatever), which isn’t a flaw in Publ itself but the end result (a hacked site) is the same. So far as I can tell there’s no way to entirely disable PHP on a Dreamhost-based Publ instance, and it’s really the ability to run PHP that makes PHP so dangerous in this world.
So, I’m not going to claim that Publ is 100% secure or unhackable. But it sure has one heck of a head start.
Comments
Before commenting, please read the comment policy.
Avatars provided via Libravatar