Skip to main content

Anarcat

Syndicate content
Pourquoi faire simple quand on peut faire compliqué.
Updated: 17 hours 10 min ago

Securing registration email

Wed, 03/20/2019 - 10:28

I've been running my own email server basically forever. Recently, I've been thinking about possible attack vectors against my personal email. There's of course a lot of private information in that email address, and if someone manages to compromise my email account, they will see a lot of personal information. That's somewhat worrisome, but there are possibly more serious problems to worry about.

TL;DR: if you can, create a second email address to register on websites and use stronger protections on that account from your regular mail.

Hacking accounts through email

Strangely what keeps me up at night is more what kind of damage an attacker could do to other accounts I hold with that email address. Because basically every online service is backed by an email address, if someone controls my email address, they can do a password reset on every account I have online. In fact, some authentication systems just gave up on passwords algother and use the email system itself for authentication, essentially using the "password reset" feature as the authentication mechanism.

Some services have protections against this: for example, GitHub require a 2FA token when doing certain changes which the attacker hopefully wouldn't have (although phishing attacks have been getting better at bypassing those protections). Other services will warn you about the password change which might be useful, except the warning is usually sent... to the hacked email address, which doesn't help at all.

The solution: a separate mailbox

I had been using an extension (anarcat+register@example.com) to store registration mail in a separate folder for a while already. This allows me to bypass greylisting on the email address, for one. Greylisting is really annoying when you register on a service or do a password reset... The extension also allows me to sort those annoying emails in a separate folder automatically with a simple Sieve rule.

More recently, I have been forced to use a completely different email alias (register@example.com) on some services that dislike having plus signs (+) in email address, even though they are perfectly valid. That got me thinking about the security problem again: if I have a different alias why not make it a completely separate account and harden that against intrusion. With a separate account, I could enforce things like SSH-only access or 2FA that would be inconvenient for my main email address when I travel, because I sometimes log into webmail for example. Because I don't frequently need access to registration mail, it seemed like a good tradeoff.

So I created a second account, with a locked password and SSH-only authentication. That way the only way someone can compromise my "registration email" is by hacking my physical machine or the server directly, not by just bruteforcing a password.

Now of course I need to figure out which sites I'm registered on with a "non-registration" email (anarcat@example.com): before I thought of using the register@ alias, I sometimes used my normal address instead. So I'll have to track those down and reset those. But it seems I already blocked a large attack surface with a very simple change and that feels quite satisfying.

Implementation details

Using syncmaildir (SMD) to sync my email, the change was fairly simple. First I need to create a second SMD profile:

if [ $(hostname) = "marcos" ]; then exit 1 fi SERVERNAME=smd-server-register CLIENTNAME=$(hostname)-register MAILBOX_LOCAL=Maildir/.register/ MAILBOX_REMOTE=Maildir TRANSLATOR_LR="smd-translate -m move -d LR register" TRANSLATOR_RL="smd-translate -m move -d RL register" EXCLUDE="Maildir/.notmuch/hooks/* Maildir/.notmuch/xapian/*"

Very similar to the normal profile, except mails get stored in the already existing Maildir/.register/ and different SSH profile and translation rules are used. The new SSH profile is basically identical to the previous one:

# wrapper for smd Host smd-server-register Hostname imap.anarc.at BatchMode yes Compression yes User register IdentitiesOnly yes IdentityFile ~/.ssh/id_ed25519_smd

Then we need to ignore the register folder in the normal configuration:

diff --git a/.smd/config.default b/.smd/config.default index c42e3d0..74a8b54 100644 --- a/.smd/config.default +++ b/.smd/config.default @@ -59,7 +59,7 @@ TRANSLATOR_RL="smd-translate -m move -d RL default" # EXCLUDE_LOCAL="Mail/spam Mail/trash" # EXCLUDE_REMOTE="OtherMail/with%20spaces" #EXCLUDE="Maildir/.notmuch/hooks/* Maildir/.notmuch/xapian/*" -EXCLUDE="Maildir/.notmuch/hooks/* Maildir/.notmuch/xapian/*" +EXCLUDE="Maildir/.notmuch/hooks/* Maildir/.notmuch/xapian/* Maildir/.register/*" #EXCLUDE_LOCAL="$MAILBOX_LOCAL/.notmuch/hooks/* $MAILBOX_LOCAL/.notmuch/xapian/*" #EXCLUDE_REMOTE="$MAILBOX_REMOTE/.notmuch/hooks/* $MAILBOX_REMOTE/.notmuch/xapian/*" #EXCLUDE_REMOTE="Maildir/Koumbit Maildir/Koumbit* Maildir/Koumbit/* Maildir/Koumbit.INBOX.Archives/ Maildir/Koumbit.INBOX.Archives.2012/ Maildir/.notmuch/hooks/* Maildir/.notmuch/xapian/*"

And finally we add the new profile to the systemd services:

diff --git a/.config/systemd/user/smd-pull.service b/.config/systemd/user/smd-pull.service index a841306..498391d 100644 --- a/.config/systemd/user/smd-pull.service +++ b/.config/systemd/user/smd-pull.service @@ -8,6 +8,7 @@ ConditionHost=!marcos Type=oneshot # --show-tags gives email counts ExecStart=/usr/bin/smd-pull --show-tags +ExecStart=/usr/bin/smd-pull --show-tags register [Install] WantedBy=multi-user.target diff --git a/.config/systemd/user/smd-push.service b/.config/systemd/user/smd-push.service index 10d53c7..caa588e 100644 --- a/.config/systemd/user/smd-push.service +++ b/.config/systemd/user/smd-push.service @@ -8,6 +8,7 @@ ConditionHost=!marcos Type=oneshot # --show-tags gives email counts ExecStart=/usr/bin/smd-push --show-tags +ExecStart=/usr/bin/smd-push --show-tags register [Install] WantedBy=multi-user.target

That's about it on the client side. On the server, the user is created with a locked password the mailbox moved over:

adduser --disabled-password register mv ~anarcat/Maildir/.register/ ~register/Maildir/ chown -R register:register Maildir/

The SSH authentication key is added to .ssh/authorized_keys, and the alias is reversed:

--- a/aliases +++ b/aliases @@ -24,7 +24,7 @@ spamtrap: anarcat spampd: anarcat junk: anarcat devnull: /dev/null -register: anarcat+register +anarcat+register: register # various sandboxes anarcat-irc: anarcat

... and the email is also added to /etc/postgrey/whitelist_recipients.

That's it: I now have a hardened email service! Of course there are other ways to harden an email address. On-disk encryption comes to mind but that only works with password-based authentication from what I understand, which is something I want to avoid to remove bruteforce attacks.

Your advice and comments are of course very welcome, as usual

Categories: External Blogs

February 2019 report: LTS, HTML mail, new phone and new job

Tue, 03/05/2019 - 21:04
Debian Long Term Support (LTS)

This is my monthly Debian LTS report.

This is my final LTS report. I have found other work and will unfortunately not be able to continue working on the LTS project in the foreseeable future. I will continue my volunteer work on Debian and might even contribute to LTS in my normal job, but not directly part of the LTS team.

It is too bad because that team is doing essential work, and needs more help. Security is, at best, lacking everywhere and I do not believe the current approach of "minimal viable product, move fast, then break things" is sustainable. The people working on Linux distributions and also the LTS people are doing hard, dirty work of maintaining free software in the long term. It's thankless but I believe it's one of the most important jobs out there right now. And I suspect there will be only more of it as time goes by.

Legacy systems are not going anywhere: this is the next generation's "y2k bug": old, forgotten software no one understands or cares to work with that suddenly break or have a critical vulnerability that needs patching. Moving faster will not help us fix this problem: it only piles up more crap to deal with for real systems running in production.

The survival of humans and other species on planet Earth in my view can only be guaranteed via a timely transition towards a stationary state, a world economy without growth.

-- Peter Custers

Website work

I again worked on the website this month, doing one more mass import (MR 53) which was finally merged by Holger Levsen, after I fixed an issue with PGP signatures showing up on the website.

I also polished the misnamed "audit" script that checks for missing announcements on the website and published it as MR 1 on the "cron" project of the webmaster team. It's still a "work in progress" because it is still too noisy: there are a few DLAs missing already and we haven't published the latest DLAs on the website.

The remaining work here is to automate the import of new announcements on the website (bug #859123). I've done what is hopefully the last mass import and updated the workflow in the wiki.

Finally, I have also done a bit of cleanup on the website that was necessary after the mass import which also required rewrite rules at the server level. Hopefully, I will have this fairly well wrapped up for whoever picks this up next.

Python GPG concerns

Following a new vulnerability (CVE-2019-6690) disclosed in the python-gnupg library, I have expressed concerns at the security reliability of the project in future updates, referring to wider issues identified by isis lovecroft in this post.

I suggested we should simply drop security support for the project, citing it didn't have many reverse dependencies. But it seems that wasn't practical and the response was that it was actually possible to keep on maintaining it an such an update was issued for jessie.

Golang concerns

Similarly, I have expressed more concerns about the maintenance of Golang packages following the disclosure of a vulnerability (CVE-2019-6486) regarding elliptic curve implementations in the core Golang libraries. An update (DLA-1664-1) was issued for the core, but because Golang is statically compiled, I was worried the update wasn't sufficient: we also needed to upload updates for any build dependency using the affected code as well.

Holger asked the golang team for help and i also asked on irc. Apparently, all the non-dev packages (with some exceptions) were binNMU'd in stretch but the process needs to be clarified.

I also wondered if this maintenance problem could be resolved in the long term by switching to dynamic linking. Ubuntu tried to switch to dynamic linking but abandoned the effort, so it seems Golang will be quite difficult to maintain for security updates in the foreseeable future.

Libarchive updates

I have reproduced the problem described in CVE-2019-1000020 and CVE-2019-1000019 in jessie. I published a fix as DLA-1668-1. I had to build the update without sbuild's overlay system (in a tar chroot) otherwise the cpio tests fail.

Netmask updates

This one was minimal: a patch was sent by the maintainer so I only wrote and sent DLA 1665-1. Interestingly, I didn't have access to the .changes file which made writing the DLA a little harder, as my workflow normally involves calling gen-DLA --save with the .changes file which autopopulates a template. I learned that .changes files are normally archived on coccia.debian.org (specifically in /srv/ftp-master.debian.org/queue/done/), but not in the case of security uploads.

Libreoffice

I once again tried to tackle an issue (CVE-2018-16858) with Libreoffice. The last time I tried to work on LibreOffice, the test suite was failing and the linker was crashing after hours of compilation and I never got anywhere. But that was wheezy, so I figured jessie might be in better shape.

I quickly got into trouble with sbuild: I ran out of space on both / and /home so I moved all my photos to external drive (!). The patch ended up being trivial. I could reproduce with a simple proof of concept, but could not quite get code execution going. It might just be I haven't found the right Python module to load, so I assumed the code was vulnerable and, given the patch was simple, it was worth doing an update.

The build ended up taking close to nine hours and 35GiB of disk space. I published DLA-1669-1 as a result.

I also opened a bug report against dput-ng against dput-ng because it still doesn't warn users about uploads to security-master the same way dput does.

Enigmail

Finally, Enigmail was finally taken off the official support list in jessie when the debian-security-support proposed update was approved.

Other free software work

Since I was going to start that new job in March, I figured I would try to take some time off before work starts. I therefore mostly tried to wrap things up and didn't do as much volunteer work as I usually do. I'm unsure I'll be able to do as much volunteer work now that I start a full time job either, so this might possibly be my last report for a while.

Debian work before the freeze

I uploaded new versions of bitlbee-mastodon (1.4.1-1), sopel (6.6.3-1 and 6.6.3-2) and dateparser (0.7.1-1). I've also sponsored new uploads of smokeping and tuptime.

I also uploaded convertdate to NEW as it was a (missing but optional) dependency of dateparser. Unfortunately, it didn't make it through NEW in time for the freeze so dateparser won't be totally fixed in buster.

I also made two new releases of feed2exec, my programmable feed reader, to fix date parsing on broken feeds, add a JSON output plugin, and fix an issue with the ikiwiki_recentchanges plugin.

New phone

I got tired and bought a new phone. Even though I have almost a dozen old phones in a plastic box here, most of them are basically unusable:

  • two are just "feature phones" - I need OSMand
  • two are Nokia n900 phones that can't read a SIM card
  • at least two have broken screens
  • one is "declared stolen or lost" (same, right?) which means it can't be used as a phone at all, which is totally stupid if you ask me

I managed to salvage the old htc-one-s I had. It's still a little buggy (it crashes randomly) and a little slow, but generally works and I really like how small it is. It's going to be hard to go back to a bigger format.

I bought fairphone2 (FP2). It was pricey, and it's crazy because they might come up with the FP3 this year, but I was sick of trying to cross-reference specification tables and LineageOS download pages. The FP2 just works with an "open" Android version (and LOS) out of the box. But more importantly, the FP project tries to avoid major human rights issues in the source of components and the production of the device, something that's way too often overlooked. Many minerals involved in the fabrication of modern electronics come from conflict zones or involve horrible (child) labour conditions. Fixing those issues should be our priority, maybe even before hardware or software freedom.

Even without addressing completely those issues, the fact that it scored a perfect 10 in iFixit's reparability score is amazing. It seems parts are difficult to find, even in Europe. The phone doesn't ship to the Americas from the original website, which makes it difficult to buy, but some shops do ship to Canada, like Ecosto.

So we'll see how that goes. I will, as usual, document my experiences in the wiki, in fairphone2.

Mailing list experiments

As part of my calendar project, I figured I would keep my "readers" informed of my progress this year and send them an update every month or so. I was inspired by this post as I said last week: I can't stop thinking about it.

So I kept working on Mailman 3. Unfortunately, only a single of my proposed patches was merged. Many of them are "work in progress" (WIP) of course, but I was hoping to get more feedback on the proposals, especially the no notification workflow. Such a workflow delegates the sending of confirmation mails to the caller, which enables them to send more complex email than the straitjacket the templating system forces you into: you could then control every part of the email, not just the body and subject, but also content type, attachments and so on. That didn't seem to get traction: some informal comments I received said this wasn't the right fix for the invite problem, but then no one is working on fixing the invite problem either, so I wonder where that is going to go.

Unabashed, I tried to provide a french translation which allowed me to send an actual invite fully translated. This was a lot of work for not much benefit, so that was frustrating as well.

In the end, I ended up just with a Bcc list that I keep as an alias in my ~/.mutt/aliases, which notmuch reads thanks to my notmuch-address hack. In the email, I proposed my readers an "opt-out": if they don't write back, they're on the mailing list. It's spammy, but the readers are not just the general public: they are people I know well, that are close to me, and to who I have given a friggin' calendar (at least most of them).

If I find the energy, I'll finish setting up Mailman 3 just the way I like and use it to do the next mailing. But I can't help but think the mailing list is overkill for this now: the mailing with a Bcc list worked without a flaw, as far as I could tell, and it means minimal maintenance. So I'm not sure I'll battle Mailman 3 much longer, which is a shame because I happen to believe it's probably our best bet to keep mailing lists (and therefore probably email itself) alive in the future.

Emailing HTML in Notmuch

I actually had to write content for that email too - just messing around with the mailing list server is one thing, but the whole point is to actually say something. Or, in my case, show something, which is difficult using plain text. So I went crazy and tried to send HTML mail with notmuch. The thread is interesting: I encourage you to read it in full, but I'll quote the first post here for posterity:

I know, I know, HTML email is "evil"[1]. I mostly never ever use it, in fact, I don't remember the last time I consciously sent HTML. Maybe I did so back when I was using Netscape Communicator[2][3], but whatever.

The reason I thought about this again is I have been doing more photography these days and, well, being allergic to social media, I have very few ways of sharing those photographs with families and friends. I have tried creating a gallery website with an RSS feed but I'm sure no one here will be surprised that the uptake is minimal, if non-existent. People expect to have stuff pushed to them, like Instagram, Facebook, Twitter or Spam does.

So I thought[4] of Email again: the original social network! I figured I would just make a mailing list, and write to my people once in a while to let them new about my new pictures. And while writing the first email, I realized it was pretty silly to not include images, or at least links to images in the email.

I'm sure you can see where this is going. A link in the email: who's going to click that. Who clicks now anyways, with all the tapping[5] going on. So the answer comes naturally: just write frigging HTML email. Don't be a rms^Wreligious zealot and do the right thing, what works basically everywhere[6] (even notmuch!).

So I started Thunderbird and thought "what the heck am I doing! there must be a better way!" After searching for "message mode emacs html email ktxbye", I found some people already thought about this problem and came up with somewhat elegant solutions[7]. I built on that by trying to come up with a pure elisp solution, which goes a little like this:

(defun anarcat/notmuch-html-convert () """create an HTML part from a Markdown body This will not work if there are *any* attachments of any form, those should be added after.""" (interactive) (save-excursion ;; fetch subject, it will be the HTML version title (message "building HTML attachment...") (message-goto-subject) (beginning-of-line) (search-forward ":") (forward-char) (let ((beg (point))) (end-of-line) (setq subject (buffer-substring beg (point)))) (message "determined title is %s..." subject) ;; wrap signature in a <pre> (message-goto-signature) (forward-line -1) ;; save and delete signature which requires special formatting (setq signature (buffer-substring (point) (point-max))) (delete-region (point) (point-max)) ;; set region to top of body then end of buffer (end-of-buffer) (message-goto-body) (narrow-to-region (point) (mark)) ;; run markdown on region (setq output-buffer-name "*notmuch-markdown-output*") (message "running markdown...") (markdown output-buffer-name) (widen) (save-excursion (set-buffer output-buffer-name) (end-of-buffer) ;; add signature formatted as <pre> (insert "\n<pre>") (insert signature) (insert "</pre>\n") (markdown-add-xhtml-header-and-footer subject)) (message "done the dirty work, re-inserting everything...") ;; restore signature (message-goto-signature) (insert signature) (message-goto-body) (insert "<#multipart type=alternative>\n") (end-of-buffer) (insert "<#part type=text/html>\n") (insert-buffer output-buffer-name) (end-of-buffer) (insert "<#/multipart>\n") (let ((f (buffer-size (get-buffer output-buffer-name)))) (message "appended HTML part (%s bytes)" f))))

For those who can't read elisp for breakfast, this does the following:

  1. parse the current email body as markdown, in a separate buffer
  2. make the current email multipart/alternative
  3. add an HTML part
  4. inject the HTML version in the HTML part

There's some nasty business with formatting the signature correctly by wrapping it in a <pre> that's going on there - I took that from Thunderbird as well.

(For those who do read elisp for breakfast, improvements and comments on the coding style are very welcome.)

The idea is that you write your email normally, but in markdown. When you're done writing that email, you launch the above function (carefully bound to "M-x anarcat/notmuch-html-convert" here) which takes that email and adds an equivalent HTML part to it. You can then even tweak that part to screw around with the raw HTML if you feel depressed or nostalgic.

What do people think? Am I insane? Could this work? Does this belong in notmuch? Or maybe in the tips section? Should I seek therapy? Do you hate markdown? Expand on the relationship between your parents and text editors.

Thanks for any feedback,

A.

PS: the above, naturally, could be adapted to parse the body as RST, asciidoc, texinfo, latex or whatever insanity you think would be more appropriate, I don't care. The idea is the same.

PPS: I remember reading about someone wanting to declare a text/markdown mimetype for email, and remembering it was all backwards and weird and I can't find the reference anymore. If some lazyweb magic person could forward the link to me I would be grateful.

[1]: one of so many: https://www.georgedillon.com/web/html_email_is_evil_still.shtml [2]: https://en.wikipedia.org/wiki/Netscape_Communicator [3]: yes my age is showing [4]: to be fair, this article encouraged me quite a bit: https://blog.chaddickerson.com/2019/01/09/replacing-facebook/ [5]: not the bass guitar one, unfortunately [6]: https://en.wikipedia.org/wiki/HTML_email#Adoption [7]: https://trey-jackson.blogspot.com/2008/01/emacs-tip-8-markdown.html

I edited the original message to include the latest version of the script, which (unfortunately) lives in my private dotfiles git repository.

In the end, all that effort didn't quite do it: the image links would break in webmail when seen from Chromium. This is apparently intended behaviour: the problem was that I am embedding the username/password of the gallery in the HTTP URL, using in-URL credentials which is apparently "deprecated" even though no standards actually says so. So I ended up generating a full HTML version of the frigging email, complete with a link on top of the email saying "if this email doesn't display properly, click the following".

Now I remember why I dislike HTML email. Yet my readers were quite happy to see the images directly and I suspect most of them wouldn't click through on individual images to see each photo, so I think it's worth the trouble.

And now that I think about it, it feels silly not to post those updates on this blog now. But the gallery is private right now, and I think I'd like to keep it that way: it gives me more freedom to share more intimate pictures with people.

Using dtach instead of screen for my IRC bouncer

I have been using irssi in a screen session for a long time now. Recently I started thinking about simplifying that setup by setting up password-less authentication to the session, but also running it as a separate user. This was especially important to keep possible compromises of the IRC client limited to a sandboxed account instead of my more powerful user.

To further limit the impact of a possible compromise, I also started using dtach instead of GNU screen to handle my irssi session: irssi can still run arbitrary code, but at least you can't just open a new window in screen and need to think a little more about how to do it.

Eventually, I could make a profile in systemd to keep it from forking at all, although I'm not sure irssi could still work in such an environment. The change broke the "auto-away script" which relies on screen's peculiar handling of the socket to signify if the session is attached, so I filed that as a feature request.

Other work
Categories: External Blogs

New large hard drive and 8-year old server anniversary

Mon, 02/25/2019 - 12:59

It's the "installation birthday" of my home server on February 22nd:

/etc/cron.daily/installation-birthday: 0 0 | | ____|___|____ 0 |~ ~ ~ ~ ~ ~| 0 | | | | ___|__|___________|___|__ |/\/\/\/\/\/\/\/\/\/\/\/| 0 | H a p p y | 0 | |/\/\/\/\/\/\/\/\/\/\/\/| | _|___|_______________________|___|__ |/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/| | | | B i r t h d a y! ! ! | | ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ | |___________________________________| Congratulations, your Debian system "marcos" was installed 8 year(s) ago today! Best wishes, Your local system administrator

I can't believe this machine I built 8 years ago has been running continuously all that time. That is far, far beyond the usual 3 or 5 year depreciation period set in most organizations. It goes to show how some hardware can be reliable in the long term.

I bought yet another new drive to deal with my ever-increasing disk use. I got a Seagate IronWolf 8TB ST8000VN0022 at Canada Computers (CC) for 290$CAD. I also bought a new enclosure as well, a transparent Orico enclosure which is kind of neat. I previously bought this thing instead, it was really hard to fit the hard drive in because the bottom was mis-aligned: you had to lift the drive slightly to fit it in the SATA connector. Even the salesman at CC couldn't figure it out. The new enclosure is a bit better, but also doesn't quite close correctly when a hard drive is present.

Compatibility and reliability

The first 8TB drive I got last week was DOA (no, not that DOA): it was "clicking" and wasn't detected by the kernel. CC took it back without questions, after they were able to plug it into something. I'm not sure that's a good sign for the reliability of that drive, but I have another running in a backup server and it has worked well so far.

I was happily surprised to see the new drive works with my old Asus P5G410-M motherboard. My previous attempt at connecting this huge drive into older equipment failed in a strange way: when connected in a Thermaltake USB-SATA dock, it would only be recognized as 4TB. I don't remember if I tried to connect it inside the server, but I do remember connecting it to curie instead which was kind of a mess. So I'm quite happy to see the drive works even on an old SATA controller, a testament to the backwards-compatibility requirements of the standard.

Setup

Of course, I used a GUID Partition Table GPT because MBR (Master Boot Record) partition tables are limited to 2TiB. I have learned about parted --align optimal to silence the warnings when creating the device:

parted /dev/sdc mklabel gpt parted -a optimal /dev/sdc mkpart primary 0% 8MB parted -a optimal /dev/sdc mkpart primary 8MB 100%

I have come to like to call parted without going into its shell. It's clean and easy to copy paste. It also makes me wonder why the Debian installer bothers with that complicated partition editor after all...

I have encrypted the drive using Debian stretch's LUKS default, but I have given special attention to the filesystem settings, given the drive is so big. Here's the commandline I ended using:

mkfs -t ext4 -j -T largefile -i 65536 -m 1 /dev/mapper/8tb_crypt

Here are the details of each bit:

  • ext4 - I still don't trust BTRFS enough, and I don't need the extra features

  • -j - journaling, probably default, but just in case

  • -T largefile - this is where things get interesting. the mkfs manpage says that -b -1 is supposed to tweak the block size according to the filesystem size, but mkfs refuses to parse this, so I had to use the -T setting. but it turns out that didn't change the block size anyways, which is still at the eternal 4KiB

  • -i 65536 ("64 KiB per inode" ratio) - the default mkfs setting would have allowed for around five hundred million (488 281 250) inodes on this disk. given that I have less than a million files to store on there so far, that seemed totally overkill, so I bumped it up.

  • -m - don't reserve as much space for root, as default (5%) would have reserved 400GB. 1% is still too big (80GB), but I can reclaim the space later with tune2fs -m 0.001 /dev/mapper/8tb_crypt. it gives me a good "heads up" before it's time to change the drive again. besides, it's not possible to pass lower, non-zero values to mkfs, strangely

Benchmarks

I performed a few benchmarks. It looks like the disk can easily saturate the SATA bus, which is limited to 150MB/s (1.5Gbit/s unencoded):

root@marcos:~# dd bs=1M count=512 conv=fdatasync if=/dev/zero of=/mnt/testfile 512+0 enregistrements lus 512+0 enregistrements écrits 536870912 bytes (537 MB, 512 MiB) copied, 3,4296 s, 157 MB/s root@marcos:~# dd bs=1M count=512 if=/mnt/testfile of=/dev/null 512+0 enregistrements lus 512+0 enregistrements écrits 536870912 bytes (537 MB, 512 MiB) copied, 0,367484 s, 1,5 GB/s root@marcos:~# hdparm -Tt /dev/sdc /dev/sdc: Timing cached reads: 2514 MB in 2.00 seconds = 1257.62 MB/sec Timing buffered disk reads: 660 MB in 3.00 seconds = 219.98 MB/sec

A SMART test succeeded after 20 hours. Transferring the files over from the older disk took even longer: at 3.5TiB used, it's quite a lot of data and the older disk does not yield the same performance as the new one. rsync seems to show numbers between 40 and 50MB/s (or MiB/s?), which means the entire transfer takes more than a day to complete.

I have considered setting up the new drive as a degraded RAID-1 array to facilitate those transfers but it doesn't seem to be worth the trouble: this will yield warnings in a few place, adds some overhead (including scrubbing, for example) and might make me freak out for nothing in the future. This is a single drive, and will probably stay that way for the foreseeable future.

The sync is therefore made with good old rsync:

rsync -aAvP /srv/ /mnt/

Some more elaborate tests performed with fio also show that random read/write performance is somewhat poor (<1MB/s):

root@marcos:/srv# fio --name=stressant --group_reporting --directory=test --size=100M --readwrite=randrw --direct=1 --numjobs=4 stressant: (g=0): rw=randrw, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1 ... fio-2.16 Starting 4 processes stressant: Laying out IO file(s) (1 file(s) / 100MB) stressant: Laying out IO file(s) (1 file(s) / 100MB) stressant: Laying out IO file(s) (1 file(s) / 100MB) stressant: Laying out IO file(s) (1 file(s) / 100MB) Jobs: 2 (f=2): [_(2),m(2)] [99.4% done] [1097KB/1305KB/0KB /s] [274/326/0 iops] [eta 00m:02s] stressant: (groupid=0, jobs=4): err= 0: pid=10161: Mon Feb 25 12:51:21 2019 read : io=205352KB, bw=586756B/s, iops=143, runt=358378msec clat (usec): min=145, max=367185, avg=23237.22, stdev=24300.33 lat (usec): min=145, max=367186, avg=23238.42, stdev=24300.31 clat percentiles (usec): | 1.00th=[ 450], 5.00th=[ 3792], 10.00th=[ 6816], 20.00th=[ 9408], | 30.00th=[12608], 40.00th=[14912], 50.00th=[17280], 60.00th=[19328], | 70.00th=[22656], 80.00th=[27264], 90.00th=[46848], 95.00th=[69120], | 99.00th=[123392], 99.50th=[148480], 99.90th=[238592], 99.95th=[272384], | 99.99th=[329728] write: io=204248KB, bw=583601B/s, iops=142, runt=358378msec clat (usec): min=164, max=322970, avg=4646.01, stdev=10840.13 lat (usec): min=165, max=322971, avg=4647.36, stdev=10840.16 clat percentiles (usec): | 1.00th=[ 195], 5.00th=[ 227], 10.00th=[ 251], 20.00th=[ 310], | 30.00th=[ 378], 40.00th=[ 494], 50.00th=[ 596], 60.00th=[ 2832], | 70.00th=[ 6176], 80.00th=[ 8896], 90.00th=[12480], 95.00th=[15552], | 99.00th=[22400], 99.50th=[33024], 99.90th=[199680], 99.95th=[234496], | 99.99th=[272384] lat (usec) : 250=4.86%, 500=16.18%, 750=7.01%, 1000=1.45% lat (msec) : 2=0.91%, 4=3.69%, 10=19.06%, 20=27.09%, 50=15.04% lat (msec) : 100=3.51%, 250=1.14%, 500=0.05% cpu : usr=0.11%, sys=0.27%, ctx=103127, majf=0, minf=31 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=51338/w=51062/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: io=205352KB, aggrb=573KB/s, minb=573KB/s, maxb=573KB/s, mint=358378msec, maxt=358378msec WRITE: io=204248KB, aggrb=569KB/s, minb=569KB/s, maxb=569KB/s, mint=358378msec, maxt=358378msec Disk stats (read/write): dm-6: ios=51862/51241, merge=0/0, ticks=1203452/250196, in_queue=1453720, util=100.00%, aggrios=51736/51295, aggrmerge=168/61, aggrticks=1196604/246444, aggrin_queue=1442968, aggrutil=100.00% sdb: ios=51736/51295, merge=168/61, ticks=1196604/246444, in_queue=1442968, util=100.00%

I am still, overall, quite happy with those results.

Categories: External Blogs

January 2019 report: LTS, Mailman 3, Vero 4k, Kubernetes, Undertime, Monkeysign, oh my!

Wed, 02/06/2019 - 10:32

January is often a long month in our northern region. Very cold, lots of snow, which can mean a lot of fun as well. But it's also a great time to cocoon (or maybe hygge?) in front of the computer and do great things. I think the last few weeks were particularly fruitful which lead to this rather lengthy report, which I hope will be nonetheless interesting.

So grab some hot coco, a coffee, tea or whatever warm beverage (or cool if you're in the southern hemisphere) and hopefully you'll learn awesome things. I know I did.

Free software volunteer work

As always, the vast majority of my time was actually spent volunteering on various projects, while scrambling near the end of the month to work on paid stuff. For the first time here I mention my Kubernetes work, but I've also worked on the new Mailman 3 packages, my monkeysign and undertime packages (including a new configuration file support for argparse), random Debian work, and Golang packaging. Oh, and I bought a new toy for my home cinema, which I warmly recommend.

Kubernetes research

While I've written multiple articles on Kubernetes for LWN in the past, I am somewhat embarrassed to say that I don't have much experience running Kubernetes itself for real out there. But for a few months, with a group of fellow sysadmins, we've been exploring various container solutions and gravitated naturally towards Kubernetes. In the last month, I particularly worked on deploying a Ceph cluster with Rook, a tool to deploy storage solutions on a Kubernetes cluster (submitting a patch while I was there). Like many things in Kubernetes, Rook is shipped as a Helm chart, more specifically as an "operator", which might be described (if I understand this right) as a container that talks with Kubernetes to orchestrate other containers.

We've similarly worked on containerizing Nextcloud, which proved to be pretty shitty at behaving like a "cloud" application: secrets and dynamic data and configuration are all mixed up in the config directory, which makes it really hard to manage sanely in a container environment. The only way we found it could work was to mount configuration as a volume, which means configuration becomes data and can't be controled through git. Which is bad. This is also how the proposed Nextcloud Helm solves this problem (on which I've provided a review), for what it's worth.

We've also worked on integrating GitLab in our workflow, so that we keep configuration as code and deploy on pushes. While GitLab talks a lot about Kubernetes integration, the actual integration features aren't that great: unless I totally misunderstood how it's supposed to work, it seems you need to provide your own container and run kubectl from it, using the tokens provided by GitLab. And if you want to do anything of significance, you will probably need to give GitLab cluster access to your Kubernetes cluster, which kind of freaks me out considering the number of security issues that keep coming out with GitLab recently.

In general, I must say I was very skeptical of Kubernetes when I first attended those conferences: too much hype, buzzwords and suits. I felt that Google just threw us a toy project to play with while they kept the real stuff to themselves. I don't think that analysis is wrong, but I do think Kubernetes has something to offer, especially for organizations still stuck in the "shared hosting" paradigm where you give users a shell account or (S?!)FTP access and run mod_php on top. Containers at least provide some level of isolation out of the box and make such multi-tenant offerings actually reasonable and much more scalable. With a little work, we've been able to setup a fully redundant and scalable storage cluster and Nextcloud service: doing this from scratch wouldn't be that hard either, but it would have been done only for Nextcloud. The trick is the knowledge and experience we gained by doing this with Nextcloud will be useful for all the other apps we'll be hosting in the future. So I think there's definitely something there.

Debian work

I participated in the Montreal BSP, of which Louis-Philippe Véronneau made a good summary. I also sponsored a few uploads and fixed a few bugs. We didn't fix that many bugs, but I gave two workshops, including my now well-tuned packaging 101 workshop, which seems to be always quite welcome. I really wish I could make a video of that talk, because I think it's useful in going through the essentials of Debian packaging and could use a wider audience. In the meantime, my reference documentation is the best you can get.

I've decided to let bugs-everywhere die in Debian. There's a release critical bug and it seems no one is really using this anymore, at least I'm not. I would probably orphan the package once it gets removed from buster, but I'm not actually the maintainer, just an uploader... A promising alternative to BE seems to be git-bug, with support for synchronization with GitHub issues.

I've otherwise tried to get my figurative "house" of Debian packages in order for the upcoming freeze, which meant new updates for

I've also sponsored the introduction of web-mode (RFS #921130) a nice package to edit HTML in Emacs and filed the usual barrage of bug reports and patches.

Elegant argparse configfile support and new date parser for undertime

I've issued two new releases for my undertime project which helps users coordinate meetings across timezones. I first started working on improvingthe date parser which mostly involved finding a new library to handle dates. I started using dateparser which behaves slightly better, and I ended up packaging it for Debian as well although I still have to re-upload undertime to use the new dependency.

That was a first 1.6.0 release, but that wasn't enough - my users wanted a configuration file! I ended up designing a simple, YAML-based configuration file parser that integrates quite well with argparse, after finding too many issues with existing solutions like Configargparse. I summarized those for the certbot project which suffered from similar issues. I'm quite happy with my small, elegant solution for config file support. It is significantly better than the one I used for Monkeysign which was (ab)using the fromfile option of argparse.

Mailman 3

Motivated by this post extolling the virtues of good old mailing lists to resist social media hegemony, I did a lot (too much) work on installing Mailman 3 on my own server. I have ran Mailman 2 mailing lists for hundreds of clients in my previous job at Koumbit and I have so far used my access there to host a few mailing lists. This time, I wanted to try something new and figured Mailman 3 might have been ready after 4 years since the 3.0 release and almost 10 years since the project started.

How wrong I was! Many things don't work: there is no french translation at all (nor any other translation, for that matter), no invite feature, templates translation is buggy, the Debian backport fails with the MySQL version in stable... it's a mess. The complete history of my failure is better documented in mail.

I worked around many of those issues. I like the fact that I was almost able to replace the missing "invite" feature through the API and there Mailman 3 is much better to look at than the older version. They did fix a lot of things and I absolutely love the web interface which allows users to interact with the mailing list as a forum. But maybe it will take a bit more time before it's ready for my use case.

Right now, I'm hesitant: either I go with a mailing list to connect with friends and family. It works with everyone because everyone uses email, if only for their password resets. The alternative is to use something like a (private?) Discourse instance, which could also double as a comments provider for my blog if I ever decide to switch away from Ikiwiki... Neither seems like a good solution, and both require extra work and maintenance, Discourse particularly so because it is very unlikely it will get shipped as a Debian package.

Vero: my new home cinema box

Speaking of Discourse, the reason I'm thinking about it is I am involved in many online forums running it. It's generally a great experience, although I wish email integration was mandatory - it's great to be able to reply through your email client, and it's not always supported. One of the forums I participate in is the Pixls.us forum where I posted a description of my photography kit, explained different NAS options I'm considering and explained part of my git-annex/dartkable workflow.

Another forum I recently started working on is the OSMC.tv forum. I first asked what were the full specifications for their neat little embedded set-top box, the Vero 4k+. I wasn't fully satisfied with the answers (the hardware is not fully open), but I ended up ordering the device and moving the "home cinema services" off of the venerable marcos server, which is going to turn 8 years old this year. This was an elaborate enterprise which involved wiring power outlets (because a ground was faulty), vacuuming the basement (because it was filthy), doing elaborate research on SSHFS setup and performance, deal with systemd bugs and so on.

In the end it was worth it: my roommates enjoy the new remote control. It's much more intuitive than the previous Bluetooth keyboard, it performs well enough, and is one less thing to overload poor marcos with.

Monkeysign alternatives testing

I already mentioned I was considering Monkeysign retirement and recently a friend asked me to sign his key so I figured it was a great time to test out possible replacements for the project. Turns out things were not as rosy as I thought.

I first tested pius and it didn't behave as well as I hoped. Generally, it asks too many cryptic questions the user shouldn't have to guess the answer to. Specifically, here's the issues I found in my review:

  1. it forces you to specify your signing key, which is error-prone and needlessly difficult for the user

  2. I don't quite understand what the first question means - there's too much to unpack there: is it for inline PGP/MIME? for sending email at all? for sending individual emails? what's going on? and the second questions

  3. the second question should be optional: i already specified my key on the commandline, it should use that as a From...

  4. the signature level is useless and generally disregarded by all software, including OpenPGP. even if it would be used, 0/1/2/3/s/n/h/q is a pretty horrible user interface.

And then it simply fails to send the email completely on dkg's key, but that might be because its key was so exotic...

Gnome-keysign didn't fare much better - I opened six different issues on the promising project:

  1. what does the internet button do?
  2. signing arbitrary keys in GUI
  3. error in french translation
  4. using mutt as a MUA does not work
  5. signing a key on the commandline never completes
  6. flatpak instructions failure

So, surprisingly, Monkeysign might survive a bit longer, as much as I have come to dislike the poor little thing...

Golang packaging

To help a friend getting the new RiseupVPN package in Debian, I uploaded a bunch of Golang dependencies (bug #919936, bug #919938, bug #919941, bug #919944, bug #919945, bug #919946, bug #919947, bug #919948) in Debian. This involved filing many bugs upstream as many of those (often tiny) packages didn't have explicit licences, so many of those couldn't actually be uploaded, but the ITPs are there and hopefully someone will complete that thankless work.

I also tried to package two other useful Golang programs, dmarc-cat and gotop, both of which also required a significant number of dependencies to be packaged (bug #920387, bug #920388, bug #920389, bug #920390, bug #921285, bug #921286, bug #921287, bug #921288). dmarc-cat has just been accepted in Debian - it's very useful to decipher DMARC reports you get when you configure your DNS to receive such reports. This is part of a larger effort to modernize my DNS and mail configuration.

But gotop is just starting - none of the dependencies have been update just yet, and I'm running out of steam a little, even though that looks like an awesome package.

Other work
  • I hosed my workstation / laptop backup by trying to be too clever with Borg. It bit back and left me holding the candle, the bastard.

  • Expanded on my disk testing documentation to include better examples of fio as part of my neglected stressant package

GitHub said I "opened 21 issues in 14 other repositories" which seems a tad insane. And there's of course probably more stuff I'm forgetting here.

Debian Long Term Support (LTS)

This is my monthly Debian LTS report.

sbuild regression

My first stop this month was to notice a problem with sbuild from buster running on jessie chroots (bug #920227). After discussions on IRC, where fellow Debian Developers basically fabricated me a patch on the fly, I sent merge request #5 which was promptly accepted and should be part of the next upload.

systemd

I again worked a bit on systemd. I marked CVE-2018-16866 as not affecting jessie, because the vulnerable code was introduced in later versions. I backported fixes for CVE-2018-16864 and CVE-2018-16865 and published the resulting package as DLA-1639-1, after doing some smoke-testing.

I still haven't gotten the courage to dig back in the large backport of tmpfiles.c required to fix CVE-2018-6954.

tiff review

I did a quick review of the fix for CVE-2018-19210 proposed upstream which seems to have brought upstream's attention back to the issue and finally merge the fix.

Enigmail EOL

After reflecting on the issue one last time, I decided to mark Enigmail as EOL in jessie, which involved an upload of debian-security-support to jessie (DLA-1657-1), unstable and a stable-pu.

DLA / website work

I worked again on fixing the LTS workflow with the DLAs on the main website. Reminder: hundreds of DLAs are missing from the website (bug #859122) and we need to figure out a way to automate the import of newer ones (bug #859123).

The details of my work are in this post but basically, I readded a bunch more DLAs to the MR and got some good feedback from the www team (in MR #47). There's still some work to be done on the DLA parser, although I have merged my own improvements (MR #46) as I felt they had been sitting for review long enough.

Next step is to deal with noise like PGP signatures correctly and thoroughly review the proposed changes.

While I was in the webmaster's backyard, I tried to help with a few things by merging a LTS errata and a paypal integration note although the latter ended up being a mistake that was reverted. I also rejected some issues (MR #13, MR #15) during a quick triage.

phpMyAdmin review

After reading this email from Lucas Kanashiro, I reviewed CVE-2018-19968 and reviewed and tested CVE-2018-19970.

Categories: External Blogs

Debian build helpers: dh dominates

Tue, 02/05/2019 - 19:54

It's been a while since someone did this. Back in 2009, Joey Hess made a talk at Debconf 9 about debhelper and mentioned in his slides (PDF) that it was used in most Debian packages. Here was the ratio (page 10):

  • debhelper: 54%
  • cdbs: 25%
  • dh: 9%
  • other: 3%

Then Lucas Nussbaum made graphs from snapshot.debian.org that did the same, but with history. His latest post (archive link because original is missing images), from 2015 confirmed Joey's 2009 results. It also showed cdbs was slowly declining and a sharp uptake in the dh usage (over debhelper). Here were the approximate numbers:

  • debhelper: 15%
  • cdbs: 15%
  • dh: 69%
  • other: 1%

I ran the numbers again. Jakub Wilk pointed me to the lintian.debian.org output that can be used to get the current state easily:

$ curl -so lintian.log.gz https://lintian.debian.org/lintian.log.gz $ zgrep debian-build-system lintian.log.gz | awk '{print $NF}' | sort | uniq -c | sort -nr 25772 dh 2268 debhelper 2124 cdbs-with-debhelper.mk 257 dhmk 123 other 8 cdbs-without-debhelper.mk

Shoving this in a LibreOffice spreadsheet (sorry, my R/Python brain is slow today) gave me this nice little graph:

As of today, the numbers are now:

  • debhelper: 7%
  • cdbs: 7%
  • dh: 84%
  • other: 1%

(No the numbers don't add up. Yes it's a rounding error. Blame LibreOffice.)

So while cdbs lost 10% of the packages in 6 years, it lost another half of its share in the last 4. It's also interesting to note that debhelper and cdbs are both shrinking at a similar rate.

This confirms that debhelper development is where everything is happening right now. The new dh(1) sequencer is also a huge improvement that almost everyone has adopted wholeheartedly.

Now of course, that remaining 15% of debhelper/cdbs (or just 7% of cdbs, depending on how pedantic you are) will be the hard part to transition. Notice how the 1% of "other" packages hasn't really moved in the last four years: that's because some packages in Debian are old, abandoned, ignored, complicated, or all of the above. So it will be difficult to convert the remaining packages and finalize this great unification Joey (unknowingly) started ten years ago, as the remaining packages are probably the hard, messy, old ones no want wants to fix because, well, "they're not broken so don't fix it".

Still, it's nice to see us agree on something for a change. I'd be quite curious to see an update of Lucas' historical graphs. It would be particularly useful to see the impact of the old Alioth server replacement with salsa.debian.org, because it runs GitLab and only supports Git. Without an easy-to-use internal hosting service, I doubt SVN, Darcs, Bzr and whatever is left in "other" there will survive very long.

Categories: External Blogs

December 2018 report: archiving Brazil, calendar and LTS

Fri, 12/21/2018 - 14:40
Last two months free software work

Keen readers probably noticed that I didn't produce a report in November. I am not sure why, but I couldn't find the time to do so. When looking back at those past two months, I didn't find that many individual projects I worked on, but there were massive ones, of the scale of archiving the entire government of Brazil or learning the intricacies of print media, both of which were slightly or largely beyond my existing skill set.

Calendar project

I've been meaning to write about this project more publicly for a while, but never found the way to do so productively. But now that the project is almost over -- I'm getting the final prints today and mailing others hopefully soon -- I think this deserves at least a few words.

As some of you might know, I bought a new camera last January. Wanting to get familiar with how it works and refresh my photography skills, I decided to embark on the project of creating a photo calendar for 2019. The basic idea was simple: take pictures regularly, then each month pick the best picture of that month, collect all those twelve pictures and send that to the corner store to print a few calendars.

Simple, right?

Well, twelve pictures turned into a whopping 8000 pictures since January, not all of which were that good of course. And of course, a calendar has twelve months -- so twelve pictures -- but also a cover and a back, which means thirteen pictures and some explaining. Being critical of my own work, it turned out that finding those pictures was sometimes difficult, especially considering the medium imposed some rules I didn't think about.

For example, the US Letter paper size imposes a different ratio (1.29) than the photographic ratio (~1.5) which means I had to reframe every photograph. Sometimes this meant discarding entire ideas. Other photos were discarded because too depressing even if I found them artistically or journalistically important: you don't want to be staring at a poor kid distressed at going into school every morning for an entire month. Another advice I got was to forget about sunsets and dark pictures, as they are difficult to render correctly in print. We're used to bright screens displaying those pictures, paper is a completely different feeling. Having a good vibe for night and star photography, this was a fairly dramatic setback, even though I still did feature two excellent pictures.

Then I got a little carried away. At the suggestion of a friend, I figured I could get rid of the traditional holiday dates and replace them with truly secular holidays, which got me involved in a deep search for layout tools, which in turn naturally brought me to this LaTeX template. Those who have worked with LaTeX (or probably any professional layout tool) know what's next: I spent a significant amount of time perfecting the rendering and crafting the final document.

Slightly upset by the prices proposed by the corner store (15$CAD/calendar!), I figured I could do better by printing on my own, especially encouraged by a friend who had access to a good color laser printer. I then spent multiple days (if not weeks) looking for the right paper, which got me in the rabbit hole of paper weights, brightness, texture, and more. I'll just say this: if you ever thought lengths were ridiculous in the imperial system, wait until you find out how you find out about how paper weights work. I finally managed to find some 270gsm gloss paper at the corner store -- after looking all over town, it was right there -- and did a first print of 15 calendars, which turned into 14 because of trouble with jammed paper. Because the printer couldn't do recto-verso copies, I had to spend basically 4 hours tending to that stupid device, bringing my loathing of printers (the machines) and my respect for printers (the people) to an entirely new level.

The time spent on the print was clearly not worth it in the end, and I ended up scheduling another print with a professional printer. The first proof are clearly superior to the ones I have done myself and, in retrospect, completely worth the 15$ per copy.

I still haven't paid for my time in any significant way on that project, something I seem to excel at doing consistently. The prints themselves are not paid for, but my time in producing those photographs is not paid either, which clearly outlines my future as a professional photographer, if any, lie far away from producing those silly calendars, at least for now.

More documentation on the project is available, in french, in calendrier-2019. I am also hoping to eventually publish a graphical review of the calendar, but for now I'll leave that for the friends and family who will receive the calendar as a gift...

Archival of Brasil

Another modest project I embarked on was a mission to archive the government of Brazil following the election the infamous Jair Bolsonaro, dictatorship supporter, homophobe, racist, nationalist and christian freak that somehow managed to get elected president of Brazil. Since he threatened to rip apart basically the entire fabric of Brazilian society, comrades were worried that he might attack and destroy precious archives and data from government archives when he comes in power, in January 2019. Like many countries in Latin America that lived under dictatorships in the 20th century, Brazil made an effort to investigate and keep memory of the atrocities that were committed during those troubled times.

Since I had written about archiving websites, those comrades naturally thought I could be of use, so we embarked on a crazy quest to archive Brazil, basically. We tried to create a movement similar to the Internet Archive (IA) response to the 2016 Trump election but were not really successful at getting IA involved. I was, fortunately, able to get the good folks at Archive Team (AT) involved and we have successfully archived a significant number of websites, adding terabytes of data to the IA through the backdoor that is AT. We also ran a bunch of archival on a special server, leveraging tools like youtube-dl, git-annex, wpull and, eventually, grab-site to archive websites, social network sites and video feeds.

I kind of burned out on the job. Following Brazilian politics was scary and traumatizing - I have been very close to Brazil folks and they are colorful, friendly people. The idea that such a horrible person could come into power there is absolutely terrifying and I kept on thinking how disgusted I would be if I would have to archive stuff from the government of Canada, which I do not particularly like either... This goes against a lot of my personal ethics, but then it beats the obscurity of pure destruction of important scientific, cultural and historical data.

Miscellaneous

Considering the workload involved in the above craziness, the fact that I worked on less project than my usual madness shouldn't come as a surprise.

  • As part of the calendar work, I wrote a new tool called moonphases which shows a list of moon phase events in the given time period, and shipped that as part of undertime 1.5 for lack of a better place.

  • AlternC revival: friends at Koumbit asked me for source code of AlternC projects I was working on. I was disappointed (but not surprised) that upstream simply took those repositories down without publishing an archive. Thankfully, I still had SVN checkouts but unfortunately, those do not have the full history, so I reconstructed repositories based on the last checkout that I had for alternc-mergelog, alternc-stats, and alternc-slavedns.

  • I packaged two new projects into Debian, bitlbee-mastodon (to connect to the new Mastodon network over IRC) and python-internetarchive (a command line interface to the IA upload forms)

  • my work on archival tools led to a moderately important patch in pywb: allow symlinking and hardlinking files instead of just copying was important to manage multiple large WARC files along with git-annex.

  • I also noticed the IA people were using a tool called slurm to diagnose bandwidth problems on their networks and implemented iface speed detection on Linux while I was there. slurm is interesting, but I also found out about bmon through the hilarious hollywood project. Each has their advantages: bmon has packets per second graphs, while slurm only has bandwidth graphs, but also notices maximal burst speeds which is very useful.

Debian Long Term Support (LTS)

This is my monthly Debian LTS report. Note that my previous report wasn't published on this blog but on the mailing list.

Enigmail / GnuPG 2.1 backport

I've spent a significant amount of time working on the Enigmail backport for a third consecutive month. I first published a straightforward backport of GnuPG 2.1 depending on the libraries available in jessie-backports last month, but then I actually rebuilt the dependencies as well and sent a "HEADS UP" to the mailing list, which finally got peoples' attention.

There are many changes bundled in that possible update: GnuPG actually depends on about half a dozen other libraries, mostly specific to GnuPG, but in some cases used by third party software as well. The most problematic one is libgcrypt20 which Emilio Pozuelo Monfort said included tens of thousands of lines of change. So even though I tested the change against cryptsetup, gpgme, libotr, mutt and Enigmail itself, there are concerns that other dependencies that merit more testing as well.

This caused many to raise the idea of aborting the work and simply marking Enigmail as unsupported in jessie. But Daniel Kahn Gillmor suggested this should also imply removing Thunderbird itself from jessie, as simply removing Enigmail will force people to use the binaries from Mozilla's add-ons service. Gillmor explained those builds include a OpenPGP.js implementation of dubious origin, which is especially problematic considering it deals with sensitive private key material.

It's unclear which way this will go next. I'm taking a break of this issue and hope others will be able to test the packges. If we keep on working on Enigmail, the next step will be to re-enable the dbg packages that were removed in the stretch updates, use dh-autoreconf correctly, remove some mingw pacakges I forgot and test gcrypt like crazy (especially the 1.7 update). We'd also update to the latest Enigmail, as it fixes issues that forced the Tails project to disable autocrypt because of weird interactions that make it send cleartext (instead of encrypted) mail in some cases.

Automatic unclaimer

My previous report yielded an interesting discussion around my work on the security tracker, specifically the "automatic unclaimer" designed to unassign issues that are idle for too long. Holger Levsen, with his new coordinator hat, tested the program and found many bugs and missing features, which I was happy to implement. After many patches and back and forth, it seems the program is working well, although it's ran by hand by the coordinator.

DLA website publication

I took a look at various issues surrounding the publication of LTS advisories on the main debian.org website. While normal security advisories are regularly published on debian.org/security about 500+ DLAs are missing from the website, mainly because DLAs are not automatically imported.

As it turns out, there is a script called parse-dla.pl that is designed to handle those entries but for some reason, they are not imported anymore. So I got to work to import the backlog and make sure new entries are properly imported.

Various fixes for parse-dla.pl were necessary to properly parse messages both from the templates generated by gen-DLA and the existing archives correctly. then I tested the result with two existing advisories, which resulted in two MR on the webml repo: add data for DLA-1561 and add dla-1580 advisory. I requested and was granted access to the repo, and eventually merged my own MRs after a review from Levsen.

I eventually used the following procedure to test importing the entire archive:

rsync -vPa master.debian.org:/home/debian/lists/debian-lts-announce . cd debian-lts-announce xz -d \*.xz cat \* > ../giant.mbox mbox2maildir ../giant.mbox debian-lts-announce.d for mail in debian-lts-announce.d/cur/\*; do ~/src/security-tracker/./parse-dla.pl $mail; done

This lead to 82 errors on an empty directory, which is not bad at all considering the amount of data processed. Of course, there many more errors in the live directory as many advisories were already present. In the live directory, this resulted in 2431 new advisories added to the website.

There were a few corner cases:

  • The first month or so didn't use DLA identifiers and many of those were not correctly imported even back then.

  • DLA-574-1 was a duplicate, covered by the DLA-574-2 regression update. But I only found the Imagemagick advisory - it looks like the qemu one was never published.

  • Similarly, the graphite2 regression was never assigned a real identifier.

  • Other cases include for example DLA-787-1 which was sent twice and the DLA-1263-1 duplicate, which was irrecuperable as it was never added to data/DLA/list

Those special cases will all need to be handled by an eventual automation of this process, which I still haven't quite figured out. Maybe a process similar to the unclaimer will be followed: the coordinator or me could add missing DLAs until we streamline the process, as it seems unlikely we will want to add more friction to the DLA release by forcing workers to send merge requests to the web team, as that will only put more pressure on the web team...

There are also nine advisories missing from the mailing list archive because of a problem with the mailing list server at that time. We'll need to extract those from people's email archives, which I am not sure how to coordinate at this point.

PHP CVE identifier confusion

I have investigated CVE-2018-19518, mistakenly identified as CVE-2018-19158 in various places, including upstream's bugtracker. I requested the latter erroneous CVE-2018-19158 to be retired to avoid any future confusion. Unfortunately, Mitre indicated the CVE was already in "active use for pre-disclosure vulnerability coordination", which made it impossible to correct the error at that level.

I've instead asked upstream to correct the metadata in their tracker but it seems nothing has changed there yet.

Categories: External Blogs

Large files with Git: LFS and git-annex

Mon, 12/10/2018 - 19:00

Git does not handle large files very well. While there is work underway to handle large repositories through the commit graph work, Git's internal design has remained surprisingly constant throughout its history, which means that storing large files into Git comes with a significant and, ultimately, prohibitive performance cost. Thankfully, other projects are helping Git address this challenge. This article compares how Git LFS and git-annex address this problem and should help readers pick the right solution for their needs.

The problem with large files

As readers probably know, Linus Torvalds wrote Git to manage the history of the kernel source code, which is a large collection of small files. Every file is a "blob" in Git's object store, addressed by its cryptographic hash. A new version of that file will store a new blob in Git's history, with no deduplication between the two versions. The pack file format can store binary deltas between similar objects, but if many objects of similar size change in a repository, that algorithm might fail to properly deduplicate. In practice, large binary files (say JPEG images) have an irritating tendency of changing completely when even the smallest change is made, which makes delta compression useless.

There have been different attempts at fixing this in the past. In 2006, Torvalds worked on improving the pack-file format to reduce object duplication between the index and the pack files. Those changes were eventually reverted because, as Nicolas Pitre put it: "that extra loose object format doesn't appear to be worth it anymore".

Then in 2009, Caca Labs worked on improving the fast-import and pack-objects Git commands to do special handling for big files, in an effort called git-bigfiles. Some of those changes eventually made it into Git: for example, since 1.7.6, Git will stream large files directly to a pack file instead of holding them all in memory. But files are still kept forever in the history.

An example of trouble I had to deal with is for the Debian security tracker, which follows all security issues in the entire Debian history in a single file. That file is around 360,000 lines for a whopping 18MB. The resulting repository takes 1.6GB of disk space and a local clone takes 21 minutes to perform, mostly taken up by Git resolving deltas. Commit, push, and pull are noticeably slower than a regular repository, taking anywhere from a few seconds to a minute depending one how old the local copy is. And running annotate on that large file can take up to ten minutes. So even though that is a simple text file, it's grown large enough to cause significant problems for Git, which is otherwise known for stellar performance.

Intuitively, the problem is that Git needs to copy files into its object store to track them. Third-party projects therefore typically solve the large-files problem by taking files out of Git. In 2009, Git evangelist Scott Chacon released GitMedia, which is a Git filter that simply takes large files out of Git. Unfortunately, there hasn't been an official release since then and it's unclear if the project is still maintained. The next effort to come up was git-fat, first released in 2012 and still maintained. But neither tool has seen massive adoption yet. If I would have to venture a guess, it might be because both require manual configuration. Both also require a custom server (rsync for git-fat; S3, SCP, Atmos, or WebDAV for GitMedia) which limits collaboration since users need access to another service.

Git LFS

That was before GitHub released Git Large File Storage (LFS) in August 2015. Like all software taking files out of Git, LFS tracks file hashes instead of file contents. So instead of adding large files into Git directly, LFS adds a pointer file to the Git repository, which looks like this:

version https://git-lfs.github.com/spec/v1 oid sha256:4d7a214614ab2935c943f9e0ff69d22eadbb8f32b1258daaa5e2ca24d17e2393 size 12345

LFS then uses Git's smudge and clean filters to show the real file on checkout. Git only stores that small text file and does so efficiently. The downside, of course, is that large files are not version controlled: only the latest version of a file is kept in the repository.

Git LFS can be used in any repository by installing the right hooks with git lfs install then asking LFS to track any given file with git lfs track. This will add the file to the .gitattributes file which will make Git run the proper LFS filters. It's also possible to add patterns to the .gitattributes file, of course. For example, this will make sure Git LFS will track MP3 and ZIP files:

$ cat .gitattributes *.mp3 filter=lfs -text *.zip filter=lfs -text

After this configuration, we use Git normally: git add, git commit, and so on will talk to Git LFS transparently.

The actual files tracked by LFS are copied to a path like .git/lfs/objects/{OID-PATH}, where {OID-PATH} is a sharded file path of the form OID[0:2]/OID[2:4]/OID and where OID is the content's hash (currently SHA-256) of the file. This brings the extra feature that multiple copies of the same file in the same repository are automatically deduplicated, although in practice this rarely occurs.

Git LFS will copy large files to that internal storage on git add. When a file is modified in the repository, Git notices, the new version is copied to the internal storage, and the pointer file is updated. The old version is left dangling until the repository is pruned.

This process only works for new files you are importing into Git, however. If a Git repository already has large files in its history, LFS can fortunately "fix" repositories by retroactively rewriting history with git lfs migrate. This has all the normal downsides of rewriting history, however --- existing clones will have to be reset to benefit from the cleanup.

LFS also supports file locking, which allows users to claim a lock on a file, making it read-only everywhere except in the locking repository. This allows users to signal others that they are working on an LFS file. Those locks are purely advisory, however, as users can remove other user's locks by using the --force flag. LFS can also prune old or unreferenced files.

The main limitation of LFS is that it's bound to a single upstream: large files are usually stored in the same location as the central Git repository. If it is hosted on GitHub, this means a default quota of 1GB storage and bandwidth, but you can purchase additional "packs" to expand both of those quotas. GitHub also limits the size of individual files to 2GB. This upset some users surprised by the bandwidth fees, which were previously hidden in GitHub's cost structure.

While the actual server-side implementation used by GitHub is closed source, there is a test server provided as an example implementation. Other Git hosting platforms have also implemented support for the LFS API, including GitLab, Gitea, and BitBucket; that level of adoption is something that git-fat and GitMedia never achieved. LFS does support hosting large files on a server other than the central one --- a project could run its own LFS server, for example --- but this will involve a different set of credentials, bringing back the difficult user onboarding that affected git-fat and GitMedia.

Another limitation is that LFS only supports pushing and pulling files over HTTP(S) --- no SSH transfers. LFS uses some tricks to bypass HTTP basic authentication, fortunately. This also might change in the future as there are proposals to add SSH support, resumable uploads through the tus.io protocol, and other custom transfer protocols.

Finally, LFS can be slow. Every file added to LFS takes up double the space on the local filesystem as it is copied to the .git/lfs/objects storage. The smudge/clean interface is also slow: it works as a pipe, but buffers the file contents in memory each time, which can be prohibitive with files larger than available memory.

git-annex

The other main player in large file support for Git is git-annex. We covered the project back in 2010, shortly after its first release, but it's certainly worth discussing what has changed in the eight years since Joey Hess launched the project.

Like Git LFS, git-annex takes large files out of Git's history. The way it handles this is by storing a symbolic link to the file in .git/annex. We should probably credit Hess for this innovation, since the Git LFS storage layout is obviously inspired by git-annex. The original design of git-annex introduced all sorts of problems however, especially on filesystems lacking symbolic-link support. So Hess has implemented different solutions to this problem. Originally, when git-annex detected such a "crippled" filesystem, it switched to direct mode, which kept files directly in the work tree, while internally committing the symbolic links into the Git repository. This design turned out to be a little confusing to users, including myself; I have managed to shoot myself in the foot more than once using this system.

Since then, git-annex has adopted a different v7 mode that is also based on smudge/clean filters, which it called "unlocked files". Like Git LFS, unlocked files will double disk space usage by default. However it is possible to reduce disk space usage by using "thin mode" which uses hard links between the internal git-annex disk storage and the work tree. The downside is, of course, that changes are immediately performed on files, which means previous file versions are automatically discarded. This can lead to data loss if users are not careful.

Furthermore, git-annex in v7 mode suffers from some of the performance problems affecting Git LFS, because both use the smudge/clean filters. Hess actually has ideas on how the smudge/clean interface could be improved. He proposes changing Git so that it stops buffering entire files into memory, allows filters to access the work tree directly, and adds the hooks he found missing (for stash, reset, and cherry-pick). Git-annex already implements some tricks to work around those problems itself but it would be better for those to be implemented in Git natively.

Being more distributed by design, git-annex does not have the same "locking" semantics as LFS. Locking a file in git-annex means protecting it from changes, so files need to actually be in the "unlocked" state to be editable, which might be counter-intuitive to new users. In general, git-annex has some of those unusual quirks and interfaces that often come with more powerful software.

And git-annex is much more powerful: it not only addresses the "large-files problem" but goes much further. For example, it supports "partial checkouts" --- downloading only some of the large files. I find that especially useful to manage my video, music, and photo collections, as those are too large to fit on my mobile devices. Git-annex also has support for location tracking, where it knows how many copies of a file exist and where, which is useful for archival purposes. And while Git LFS is only starting to look at transfer protocols other than HTTP, git-annex already supports a large number through a special remote protocol that is fairly easy to implement.

"Large files" is therefore only scratching the surface of what git-annex can do: I have used it to build an archival system for remote native communities in northern Québec, while others have built a similar system in Brazil. It's also used by the scientific community in projects like GIN and DataLad, which manage terabytes of data. Another example is the Japanese American Legacy Project which manages "upwards of 100 terabytes of collections, transporting them from small cultural heritage sites on USB drives".

Unfortunately, git-annex is not well supported by hosting providers. GitLab used to support it, but since it implemented Git LFS, it dropped support for git-annex, saying it was a "burden to support". Fortunately, thanks to git-annex's flexibility, it may eventually be possible to treat LFS servers as just another remote which would make git-annex capable of storing files on those servers again.

Conclusion

Git LFS and git-annex are both mature and well maintained programs that deal efficiently with large files in Git. LFS is easier to use and is well supported by major Git hosting providers, but it's less flexible than git-annex.

Git-annex, in comparison, allows you to store your content anywhere and espouses Git's distributed nature more faithfully. It also uses all sorts of tricks to save disk space and improve performance, so it should generally be faster than Git LFS. Learning git-annex, however, feels like learning Git: you always feel you are not quite there and you can always learn more. It's a double-edged sword and can feel empowering for some users and terrifyingly hard for others. Where you stand on the "power-user" scale, along with project-specific requirements will ultimately determine which solution is the right one for you.

Ironically, after thorough evaluation of large-file solutions for the Debian security tracker, I ended up proposing to rewrite history and split the file by year which improved all performance markers by at least an order of magnitude. As it turns out, keeping history is critical for the security team so any solution that moves large files outside of the Git repository is not acceptable to them. Therefore, before adding large files into Git, you might want to think about organizing your content correctly first. But if large files are unavoidable, the Git LFS and git-annex projects allow users to keep using most of their current workflow.

This article first appeared in the Linux Weekly News.

Categories: External Blogs