Writing

Software, technology, sysadmin war stories, and more. Feed
Sunday, August 9, 2020

Tripping over the potholes in too many libraries

While on my most recent break from writing, I pondered a bunch of things that keep seeming to come up in issues of reliability or maintainability in software. At least one of them is probably not going to make me many friends based on the reactions I've had to the concept in its larval form. Still, I think it needs to be explored.

In short, I think it's become entirely too easy for people using certain programming languages to use libraries from the wide world of clowns that is the Internet. Their ecosystems make it very very easy to become reliant on this stuff. Trouble is, those libraries are frequently shit. If something about it is broken, you might not be able to code around it, and may have to actually deal with them to get it fixed.

Repeat 100 times, and now you have a real problem brewing.

Some of this came from a recent chat with a friend. They asked why I thought other people liked a particular language. I said that a bunch of languages have giant online collections of libraries, and they make it very easy to import them - that is, depend on them. I'm talking about stuff that's WAY easier than things like CPAN (for Perl) ever did. I'm talking about newer things like Go, Rust, and yeah, node.js too. People really dig this stuff! I added that Python isn't quite as easy but there are tools which try to help you out in this regard.

Obviously, the choice of language isn't particularly important here. You can get yourself into this situation with just about anything. It's a mindset thing: are you willing to offload that job to a library?

So they asked what the alternative might be. I said, well, first of all, not having a buttload of libraries that you could import at the drop of a hat might be a start. Some languages let you just point at a GitHub URL or whatever (some even simpler than that), and that's it. They said it would "result in a lot more work to accomplish just about anything people use these languages for". I agreed with this, saying that people would find themselves having to write a lot more stuff.

My friend didn't think that sounded fun. I countered that the situation of being able to import (depend on) anything trivially did not sound fun to me. Clearly, we did not agree on this matter. We got to talking about some post about "glue languages" that had made the usual HN rounds not too long ago, and he figured most programmers "fall into the glue category". He thought software would be much worse if there weren't popular and easy to import libraries that solved problems.

My guess was that people get into these situations where it seems like a library is going to be a solid "100% solution", and yet it lets you down and maybe reaches the 80% mark. There are best practices missing, obvious design flaws, bugs, security holes, or whatever else you can imagine. You reach for software on the virtual shelf hoping it's solid, but frequently it is not. (If it was, I wouldn't be here complaining about it.)

I have a story about a time when someone's choice to offload as much as possible to external libraries for their tool caused pain for their users. It came on my radar when I tried to use their tool and hit the same snag.

There was this tool that you had to run to talk to Kubernetes stuff. Because of some wacky decisions in the infrastructure, you really couldn't run the actual native CLI tools yourself. You had to run it through this wrapper.

I was building a proof-of-concept tool that would connect to a few hundred of your job's instances and would effectively "tail -f" their stderr and aggregate it in a useful fashion right there on your local machine. It was intended to show people that log handling isn't magic, and you can get value from some relatively simple tooling, but that's not the story here. The story is the wrapper I had to call, and what happened with it.

While working on this, I got their wrapper tool into this failure mode where it would complain about some corrupted config file, and that would be it. It would "latch" like this eventually, and then all invocations would fail. It seemed like running it in my massively parallel manner made it that much more likely to happen.

Doing my own homework and digging around on the company chat and bug systems for other reports of this happening turned up something not too promising: the responsible team told the affected user to delete some dotfile in their home directory. That was the whole response. Not good.

I could do that too, but the problem is, it would eventually happen again. The more instances I ran in parallel, the faster it would latch in the bad state, and then I'd have to shut the whole thing down, drop the file, and start over.

I was getting tired of this, and decided to dig in. That's where this story comes back around to the "80% library" problem.

It turned out the dotfile was used by this program to remember the last time it had run, and the last time it had yelled at you about being old. The team had decided that this binary you manually dropped onto your Mac periodically needed to complain when it was sufficiently old. But, then they made it so it wouldn't complain EVERY time. It kept a counter and would maybe yell every tenth time, or something like that.

And so, it had a file that logged this stuff. Trouble is, that file was getting corrupted. When this happened, it failed to read it back in, and since that was treated as a fatal error, the program would not continue. There was no --ignore-that-damn-thing flag to keep going.

Looking at that file showed something wacky: the file looked like it had a full set of "var = value" lines, but then it had the tail end of the file again.

That is, it might look like "This is a config file. g file." It's almost like a longer version was written to the file, and then a shorter version was written on top of that, but the file wasn't truncated afterward. Weird, right?

Of course, anyone who's been down this road is hopping up and down yelling at their screen right now going "THEY DIDN'T USE LOCKING!" or "THEY DIDN'T USE A TEMP FILE AND RENAME" or something like it. And yeah, they're right. This thing totally did neither of those things.

When you ran it, it just opened the file and did a write. If you ran it a bunch of times in parallel, they'd all stomp all over each other, and unsurprisingly, the results sometimes yielded a config file that was not entirely parseable.

It could have used flock() or something like that. It didn't.

It could have written to the result from a mktemp() type function and then used rename() to atomically drop it into place. It didn't.

Expecting that, I got a copy of their source and went looking for the spot which was missing the file-writing paranoia stuff. I couldn't find it. All I found was some reference to this library that did config file reading and writing, and a couple of calls into it. The actual file I/O was hidden away in that other library which lived somewhere on the Internet.

Sure enough, that code had no way to do sensible locking or atomic writes. Worse still, there was no chance of handing it a "sane" file descriptor, or tricking it into writing to a safe temp path that I could then rename into place in the company's wrapper tool.

The only way to fix it would be in this third-party library. That would mean either forking it and maintaining it from there, or working with the upstream and hoping they'd take me seriously and accept it.

I've already stated that I've had bad times with this stuff, and so I tend to not engage with such projects.

I decided I had already done plenty as it was, and wasn't going to clean up their mess. This team chose to use this library, so they can figure out how they're going to deal with the problem and get the fix pushed upstream, or whatever.

I opened an internal company bug report with the team: tool X corrupts file Y when it races with other instances of itself, and then won't run. I linked to places where other people had run into it to show that it wasn't just me doing something pathological (lest they try to discount my report).

Then I waited to see what would happen.

They responded. What did they do? They made it catch the situation where the dotfile read failed, and made it not blow up the whole program. Instead, they just carried on as if it wasn't there.

There are just so many missed opportunities here. In a different environment, it would be an opportunity to teach people about locking, atomic writes, the fact that write() can return before consuming the whole buffer, and all of those other fun Unixy things you learn the hard way. Then we could have written something that did sensible writes and used it across *ALL* of the code at that company.

But, since they had abdicated that responsibility, they were at the mercy of some project that had no particular reason to care about them. I will never know why the team chose to handle my report by swallowing the error instead of dealing with upstream, but that's what happened.

Now repeat this pattern a million times, and you have the state of the world today: a bunch of dumb paper cuts that never really go away.

There's another problem with the way people respond to situations like this. I told the above story to someone who knew my career history of working at certain companies with Lots of Actual Linux Boxes. What they said was really disheartening:

Many non-trivial libraries would contain all sorts of issues if they were applied against Google-scale problems.

Yeah, that's right, because I worked for G or FB or whatever, somehow any time I have a problem with something, it's because I'm trying to do something at too big of a scale? Are you shitting me? COME ON.

I said that they were trying to casually dismiss the sort of things that I consider table stakes, and that right there is my problem with the situation. This didn't quite land the first time, so I tried another approach.

I show up with a problem ("hey, this thing keeps getting corrupted because X and Y") and suddenly it's because I'm "from" G or FB or something and I "want unreasonable things" from their stuff. So, my request is invalid, thank you drive though.

That is what I mean by 80%. I live in that other 20% because I need things to work more than that. It's not even that unusual, because, remember, my chat log and bug/ticket searches had turned up other people reporting this same problem. That's how I learned the "wisdom" of "just delete the dotfile".

It seems to boil down to this: people rely on libraries. They turn out to be mostly crap. The more you introduce, the more likely it is that you will get something really bad in there. So, it seems like the rational approach would be to be very selective about these things, and not grab too many, if at all.

But, if you work backwards, you can see that making it very easy to add some random library means that it's much more likely that someone will. Think of it as an "attractive nuisance". That turns the crank and the next thing you know, you have breathtaking dependency trees chock-full of dumb little foibles and lacking best practices.

Now we have this conundrum. That one library lowered the barrier to entry for someone to write that tool. True. Can't deny that.

It let someone ship something that sometimes works. Also true.

But, it gave them a false sense of completion and safety, when it is neither done nor safe. The tool will fail eventually given enough use, and (at least until they added the "ignore the failed read" thing), will latch itself into a broken state and won't ever work again without manual intervention.

Ask yourself: is that really a good thing? Do you WANT people being able to ship code like that without understanding the finer points of what's going on? Yeah, we obviously have to make the point that the systems should not be so damned complicated underneath, and having to worry about atomic writes and locking is annoying as hell, but it's what exists. If you're going to use the filesystem directly, you HAVE to solve for it. It's part of the baggage which comes with the world of POSIX-ish filesystems.

This whole thing goes into even darker places, but I think I'll stop there for now. Needless to say, I have more to write on this larger topic in the future.