Skip to main content

Anarcat

Syndicate content
Pourquoi faire simple quand on peut faire compliqué.
Updated: 3 weeks 4 days ago

December 2018 report: archiving Brazil, calendar and LTS

Fri, 12/21/2018 - 14:40
Last two months free software work

Keen readers probably noticed that I didn't produce a report in November. I am not sure why, but I couldn't find the time to do so. When looking back at those past two months, I didn't find that many individual projects I worked on, but there were massive ones, of the scale of archiving the entire government of Brazil or learning the intricacies of print media, both of which were slightly or largely beyond my existing skill set.

Calendar project

I've been meaning to write about this project more publicly for a while, but never found the way to do so productively. But now that the project is almost over -- I'm getting the final prints today and mailing others hopefully soon -- I think this deserves at least a few words.

As some of you might know, I bought a new camera last January. Wanting to get familiar with how it works and refresh my photography skills, I decided to embark on the project of creating a photo calendar for 2019. The basic idea was simple: take pictures regularly, then each month pick the best picture of that month, collect all those twelve pictures and send that to the corner store to print a few calendars.

Simple, right?

Well, twelve pictures turned into a whopping 8000 pictures since January, not all of which were that good of course. And of course, a calendar has twelve months -- so twelve pictures -- but also a cover and a back, which means thirteen pictures and some explaining. Being critical of my own work, it turned out that finding those pictures was sometimes difficult, especially considering the medium imposed some rules I didn't think about.

For example, the US Letter paper size imposes a different ratio (1.29) than the photographic ratio (~1.5) which means I had to reframe every photograph. Sometimes this meant discarding entire ideas. Other photos were discarded because too depressing even if I found them artistically or journalistically important: you don't want to be staring at a poor kid distressed at going into school every morning for an entire month. Another advice I got was to forget about sunsets and dark pictures, as they are difficult to render correctly in print. We're used to bright screens displaying those pictures, paper is a completely different feeling. Having a good vibe for night and star photography, this was a fairly dramatic setback, even though I still did feature two excellent pictures.

Then I got a little carried away. At the suggestion of a friend, I figured I could get rid of the traditional holiday dates and replace them with truly secular holidays, which got me involved in a deep search for layout tools, which in turn naturally brought me to this LaTeX template. Those who have worked with LaTeX (or probably any professional layout tool) know what's next: I spent a significant amount of time perfecting the rendering and crafting the final document.

Slightly upset by the prices proposed by the corner store (15$CAD/calendar!), I figured I could do better by printing on my own, especially encouraged by a friend who had access to a good color laser printer. I then spent multiple days (if not weeks) looking for the right paper, which got me in the rabbit hole of paper weights, brightness, texture, and more. I'll just say this: if you ever thought lengths were ridiculous in the imperial system, wait until you find out how you find out about how paper weights work. I finally managed to find some 270gsm gloss paper at the corner store -- after looking all over town, it was right there -- and did a first print of 15 calendars, which turned into 14 because of trouble with jammed paper. Because the printer couldn't do recto-verso copies, I had to spend basically 4 hours tending to that stupid device, bringing my loathing of printers (the machines) and my respect for printers (the people) to an entirely new level.

The time spent on the print was clearly not worth it in the end, and I ended up scheduling another print with a professional printer. The first proof are clearly superior to the ones I have done myself and, in retrospect, completely worth the 15$ per copy.

I still haven't paid for my time in any significant way on that project, something I seem to excel at doing consistently. The prints themselves are not paid for, but my time in producing those photographs is not paid either, which clearly outlines my future as a professional photographer, if any, lie far away from producing those silly calendars, at least for now.

More documentation on the project is available, in french, in calendrier-2019. I am also hoping to eventually publish a graphical review of the calendar, but for now I'll leave that for the friends and family who will receive the calendar as a gift...

Archival of Brasil

Another modest project I embarked on was a mission to archive the government of Brazil following the election the infamous Jair Bolsonaro, dictatorship supporter, homophobe, racist, nationalist and christian freak that somehow managed to get elected president of Brazil. Since he threatened to rip apart basically the entire fabric of Brazilian society, comrades were worried that he might attack and destroy precious archives and data from government archives when he comes in power, in January 2019. Like many countries in Latin America that lived under dictatorships in the 20th century, Brazil made an effort to investigate and keep memory of the atrocities that were committed during those troubled times.

Since I had written about archiving websites, those comrades naturally thought I could be of use, so we embarked on a crazy quest to archive Brazil, basically. We tried to create a movement similar to the Internet Archive (IA) response to the 2016 Trump election but were not really successful at getting IA involved. I was, fortunately, able to get the good folks at Archive Team (AT) involved and we have successfully archived a significant number of websites, adding terabytes of data to the IA through the backdoor that is AT. We also ran a bunch of archival on a special server, leveraging tools like youtube-dl, git-annex, wpull and, eventually, grab-site to archive websites, social network sites and video feeds.

I kind of burned out on the job. Following Brazilian politics was scary and traumatizing - I have been very close to Brazil folks and they are colorful, friendly people. The idea that such a horrible person could come into power there is absolutely terrifying and I kept on thinking how disgusted I would be if I would have to archive stuff from the government of Canada, which I do not particularly like either... This goes against a lot of my personal ethics, but then it beats the obscurity of pure destruction of important scientific, cultural and historical data.

Miscellaneous

Considering the workload involved in the above craziness, the fact that I worked on less project than my usual madness shouldn't come as a surprise.

  • As part of the calendar work, I wrote a new tool called moonphases which shows a list of moon phase events in the given time period, and shipped that as part of undertime 1.5 for lack of a better place.

  • AlternC revival: friends at Koumbit asked me for source code of AlternC projects I was working on. I was disappointed (but not surprised) that upstream simply took those repositories down without publishing an archive. Thankfully, I still had SVN checkouts but unfortunately, those do not have the full history, so I reconstructed repositories based on the last checkout that I had for alternc-mergelog, alternc-stats, and alternc-slavedns.

  • I packaged two new projects into Debian, bitlbee-mastodon (to connect to the new Mastodon network over IRC) and python-internetarchive (a command line interface to the IA upload forms)

  • my work on archival tools led to a moderately important patch in pywb: allow symlinking and hardlinking files instead of just copying was important to manage multiple large WARC files along with git-annex.

  • I also noticed the IA people were using a tool called slurm to diagnose bandwidth problems on their networks and implemented iface speed detection on Linux while I was there. slurm is interesting, but I also found out about bmon through the hilarious hollywood project. Each has their advantages: bmon has packets per second graphs, while slurm only has bandwidth graphs, but also notices maximal burst speeds which is very useful.

Debian Long Term Support (LTS)

This is my monthly Debian LTS report. Note that my previous report wasn't published on this blog but on the mailing list.

Enigmail / GnuPG 2.1 backport

I've spent a significant amount of time working on the Enigmail backport for a third consecutive month. I first published a straightforward backport of GnuPG 2.1 depending on the libraries available in jessie-backports last month, but then I actually rebuilt the dependencies as well and sent a "HEADS UP" to the mailing list, which finally got peoples' attention.

There are many changes bundled in that possible update: GnuPG actually depends on about half a dozen other libraries, mostly specific to GnuPG, but in some cases used by third party software as well. The most problematic one is libgcrypt20 which Emilio Pozuelo Monfort said included tens of thousands of lines of change. So even though I tested the change against cryptsetup, gpgme, libotr, mutt and Enigmail itself, there are concerns that other dependencies that merit more testing as well.

This caused many to raise the idea of aborting the work and simply marking Enigmail as unsupported in jessie. But Daniel Kahn Gillmor suggested this should also imply removing Thunderbird itself from jessie, as simply removing Enigmail will force people to use the binaries from Mozilla's add-ons service. Gillmor explained those builds include a OpenPGP.js implementation of dubious origin, which is especially problematic considering it deals with sensitive private key material.

It's unclear which way this will go next. I'm taking a break of this issue and hope others will be able to test the packges. If we keep on working on Enigmail, the next step will be to re-enable the dbg packages that were removed in the stretch updates, use dh-autoreconf correctly, remove some mingw pacakges I forgot and test gcrypt like crazy (especially the 1.7 update). We'd also update to the latest Enigmail, as it fixes issues that forced the Tails project to disable autocrypt because of weird interactions that make it send cleartext (instead of encrypted) mail in some cases.

Automatic unclaimer

My previous report yielded an interesting discussion around my work on the security tracker, specifically the "automatic unclaimer" designed to unassign issues that are idle for too long. Holger Levsen, with his new coordinator hat, tested the program and found many bugs and missing features, which I was happy to implement. After many patches and back and forth, it seems the program is working well, although it's ran by hand by the coordinator.

DLA website publication

I took a look at various issues surrounding the publication of LTS advisories on the main debian.org website. While normal security advisories are regularly published on debian.org/security about 500+ DLAs are missing from the website, mainly because DLAs are not automatically imported.

As it turns out, there is a script called parse-dla.pl that is designed to handle those entries but for some reason, they are not imported anymore. So I got to work to import the backlog and make sure new entries are properly imported.

Various fixes for parse-dla.pl were necessary to properly parse messages both from the templates generated by gen-DLA and the existing archives correctly. then I tested the result with two existing advisories, which resulted in two MR on the webml repo: add data for DLA-1561 and add dla-1580 advisory. I requested and was granted access to the repo, and eventually merged my own MRs after a review from Levsen.

I eventually used the following procedure to test importing the entire archive:

rsync -vPa master.debian.org:/home/debian/lists/debian-lts-announce . cd debian-lts-announce xz -d \*.xz cat \* > ../giant.mbox mbox2maildir ../giant.mbox debian-lts-announce.d for mail in debian-lts-announce.d/cur/\*; do ~/src/security-tracker/./parse-dla.pl $mail; done

This lead to 82 errors on an empty directory, which is not bad at all considering the amount of data processed. Of course, there many more errors in the live directory as many advisories were already present. In the live directory, this resulted in 2431 new advisories added to the website.

There were a few corner cases:

  • The first month or so didn't use DLA identifiers and many of those were not correctly imported even back then.

  • DLA-574-1 was a duplicate, covered by the DLA-574-2 regression update. But I only found the Imagemagick advisory - it looks like the qemu one was never published.

  • Similarly, the graphite2 regression was never assigned a real identifier.

  • Other cases include for example DLA-787-1 which was sent twice and the DLA-1263-1 duplicate, which was irrecuperable as it was never added to data/DLA/list

Those special cases will all need to be handled by an eventual automation of this process, which I still haven't quite figured out. Maybe a process similar to the unclaimer will be followed: the coordinator or me could add missing DLAs until we streamline the process, as it seems unlikely we will want to add more friction to the DLA release by forcing workers to send merge requests to the web team, as that will only put more pressure on the web team...

There are also nine advisories missing from the mailing list archive because of a problem with the mailing list server at that time. We'll need to extract those from people's email archives, which I am not sure how to coordinate at this point.

PHP CVE identifier confusion

I have investigated CVE-2018-19518, mistakenly identified as CVE-2018-19158 in various places, including upstream's bugtracker. I requested the latter erroneous CVE-2018-19158 to be retired to avoid any future confusion. Unfortunately, Mitre indicated the CVE was already in "active use for pre-disclosure vulnerability coordination", which made it impossible to correct the error at that level.

I've instead asked upstream to correct the metadata in their tracker but it seems nothing has changed there yet.

Categories: External Blogs

Large files with Git: LFS and git-annex

Mon, 12/10/2018 - 19:00

Git does not handle large files very well. While there is work underway to handle large repositories through the commit graph work, Git's internal design has remained surprisingly constant throughout its history, which means that storing large files into Git comes with a significant and, ultimately, prohibitive performance cost. Thankfully, other projects are helping Git address this challenge. This article compares how Git LFS and git-annex address this problem and should help readers pick the right solution for their needs.

The problem with large files

As readers probably know, Linus Torvalds wrote Git to manage the history of the kernel source code, which is a large collection of small files. Every file is a "blob" in Git's object store, addressed by its cryptographic hash. A new version of that file will store a new blob in Git's history, with no deduplication between the two versions. The pack file format can store binary deltas between similar objects, but if many objects of similar size change in a repository, that algorithm might fail to properly deduplicate. In practice, large binary files (say JPEG images) have an irritating tendency of changing completely when even the smallest change is made, which makes delta compression useless.

There have been different attempts at fixing this in the past. In 2006, Torvalds worked on improving the pack-file format to reduce object duplication between the index and the pack files. Those changes were eventually reverted because, as Nicolas Pitre put it: "that extra loose object format doesn't appear to be worth it anymore".

Then in 2009, Caca Labs worked on improving the fast-import and pack-objects Git commands to do special handling for big files, in an effort called git-bigfiles. Some of those changes eventually made it into Git: for example, since 1.7.6, Git will stream large files directly to a pack file instead of holding them all in memory. But files are still kept forever in the history.

An example of trouble I had to deal with is for the Debian security tracker, which follows all security issues in the entire Debian history in a single file. That file is around 360,000 lines for a whopping 18MB. The resulting repository takes 1.6GB of disk space and a local clone takes 21 minutes to perform, mostly taken up by Git resolving deltas. Commit, push, and pull are noticeably slower than a regular repository, taking anywhere from a few seconds to a minute depending one how old the local copy is. And running annotate on that large file can take up to ten minutes. So even though that is a simple text file, it's grown large enough to cause significant problems for Git, which is otherwise known for stellar performance.

Intuitively, the problem is that Git needs to copy files into its object store to track them. Third-party projects therefore typically solve the large-files problem by taking files out of Git. In 2009, Git evangelist Scott Chacon released GitMedia, which is a Git filter that simply takes large files out of Git. Unfortunately, there hasn't been an official release since then and it's unclear if the project is still maintained. The next effort to come up was git-fat, first released in 2012 and still maintained. But neither tool has seen massive adoption yet. If I would have to venture a guess, it might be because both require manual configuration. Both also require a custom server (rsync for git-fat; S3, SCP, Atmos, or WebDAV for GitMedia) which limits collaboration since users need access to another service.

Git LFS

That was before GitHub released Git Large File Storage (LFS) in August 2015. Like all software taking files out of Git, LFS tracks file hashes instead of file contents. So instead of adding large files into Git directly, LFS adds a pointer file to the Git repository, which looks like this:

version https://git-lfs.github.com/spec/v1 oid sha256:4d7a214614ab2935c943f9e0ff69d22eadbb8f32b1258daaa5e2ca24d17e2393 size 12345

LFS then uses Git's smudge and clean filters to show the real file on checkout. Git only stores that small text file and does so efficiently. The downside, of course, is that large files are not version controlled: only the latest version of a file is kept in the repository.

Git LFS can be used in any repository by installing the right hooks with git lfs install then asking LFS to track any given file with git lfs track. This will add the file to the .gitattributes file which will make Git run the proper LFS filters. It's also possible to add patterns to the .gitattributes file, of course. For example, this will make sure Git LFS will track MP3 and ZIP files:

$ cat .gitattributes *.mp3 filter=lfs -text *.zip filter=lfs -text

After this configuration, we use Git normally: git add, git commit, and so on will talk to Git LFS transparently.

The actual files tracked by LFS are copied to a path like .git/lfs/objects/{OID-PATH}, where {OID-PATH} is a sharded file path of the form OID[0:2]/OID[2:4]/OID and where OID is the content's hash (currently SHA-256) of the file. This brings the extra feature that multiple copies of the same file in the same repository are automatically deduplicated, although in practice this rarely occurs.

Git LFS will copy large files to that internal storage on git add. When a file is modified in the repository, Git notices, the new version is copied to the internal storage, and the pointer file is updated. The old version is left dangling until the repository is pruned.

This process only works for new files you are importing into Git, however. If a Git repository already has large files in its history, LFS can fortunately "fix" repositories by retroactively rewriting history with git lfs migrate. This has all the normal downsides of rewriting history, however --- existing clones will have to be reset to benefit from the cleanup.

LFS also supports file locking, which allows users to claim a lock on a file, making it read-only everywhere except in the locking repository. This allows users to signal others that they are working on an LFS file. Those locks are purely advisory, however, as users can remove other user's locks by using the --force flag. LFS can also prune old or unreferenced files.

The main limitation of LFS is that it's bound to a single upstream: large files are usually stored in the same location as the central Git repository. If it is hosted on GitHub, this means a default quota of 1GB storage and bandwidth, but you can purchase additional "packs" to expand both of those quotas. GitHub also limits the size of individual files to 2GB. This upset some users surprised by the bandwidth fees, which were previously hidden in GitHub's cost structure.

While the actual server-side implementation used by GitHub is closed source, there is a test server provided as an example implementation. Other Git hosting platforms have also implemented support for the LFS API, including GitLab, Gitea, and BitBucket; that level of adoption is something that git-fat and GitMedia never achieved. LFS does support hosting large files on a server other than the central one --- a project could run its own LFS server, for example --- but this will involve a different set of credentials, bringing back the difficult user onboarding that affected git-fat and GitMedia.

Another limitation is that LFS only supports pushing and pulling files over HTTP(S) --- no SSH transfers. LFS uses some tricks to bypass HTTP basic authentication, fortunately. This also might change in the future as there are proposals to add SSH support, resumable uploads through the tus.io protocol, and other custom transfer protocols.

Finally, LFS can be slow. Every file added to LFS takes up double the space on the local filesystem as it is copied to the .git/lfs/objects storage. The smudge/clean interface is also slow: it works as a pipe, but buffers the file contents in memory each time, which can be prohibitive with files larger than available memory.

git-annex

The other main player in large file support for Git is git-annex. We covered the project back in 2010, shortly after its first release, but it's certainly worth discussing what has changed in the eight years since Joey Hess launched the project.

Like Git LFS, git-annex takes large files out of Git's history. The way it handles this is by storing a symbolic link to the file in .git/annex. We should probably credit Hess for this innovation, since the Git LFS storage layout is obviously inspired by git-annex. The original design of git-annex introduced all sorts of problems however, especially on filesystems lacking symbolic-link support. So Hess has implemented different solutions to this problem. Originally, when git-annex detected such a "crippled" filesystem, it switched to direct mode, which kept files directly in the work tree, while internally committing the symbolic links into the Git repository. This design turned out to be a little confusing to users, including myself; I have managed to shoot myself in the foot more than once using this system.

Since then, git-annex has adopted a different v7 mode that is also based on smudge/clean filters, which it called "unlocked files". Like Git LFS, unlocked files will double disk space usage by default. However it is possible to reduce disk space usage by using "thin mode" which uses hard links between the internal git-annex disk storage and the work tree. The downside is, of course, that changes are immediately performed on files, which means previous file versions are automatically discarded. This can lead to data loss if users are not careful.

Furthermore, git-annex in v7 mode suffers from some of the performance problems affecting Git LFS, because both use the smudge/clean filters. Hess actually has ideas on how the smudge/clean interface could be improved. He proposes changing Git so that it stops buffering entire files into memory, allows filters to access the work tree directly, and adds the hooks he found missing (for stash, reset, and cherry-pick). Git-annex already implements some tricks to work around those problems itself but it would be better for those to be implemented in Git natively.

Being more distributed by design, git-annex does not have the same "locking" semantics as LFS. Locking a file in git-annex means protecting it from changes, so files need to actually be in the "unlocked" state to be editable, which might be counter-intuitive to new users. In general, git-annex has some of those unusual quirks and interfaces that often come with more powerful software.

And git-annex is much more powerful: it not only addresses the "large-files problem" but goes much further. For example, it supports "partial checkouts" --- downloading only some of the large files. I find that especially useful to manage my video, music, and photo collections, as those are too large to fit on my mobile devices. Git-annex also has support for location tracking, where it knows how many copies of a file exist and where, which is useful for archival purposes. And while Git LFS is only starting to look at transfer protocols other than HTTP, git-annex already supports a large number through a special remote protocol that is fairly easy to implement.

"Large files" is therefore only scratching the surface of what git-annex can do: I have used it to build an archival system for remote native communities in northern Québec, while others have built a similar system in Brazil. It's also used by the scientific community in projects like GIN and DataLad, which manage terabytes of data. Another example is the Japanese American Legacy Project which manages "upwards of 100 terabytes of collections, transporting them from small cultural heritage sites on USB drives".

Unfortunately, git-annex is not well supported by hosting providers. GitLab used to support it, but since it implemented Git LFS, it dropped support for git-annex, saying it was a "burden to support". Fortunately, thanks to git-annex's flexibility, it may eventually be possible to treat LFS servers as just another remote which would make git-annex capable of storing files on those servers again.

Conclusion

Git LFS and git-annex are both mature and well maintained programs that deal efficiently with large files in Git. LFS is easier to use and is well supported by major Git hosting providers, but it's less flexible than git-annex.

Git-annex, in comparison, allows you to store your content anywhere and espouses Git's distributed nature more faithfully. It also uses all sorts of tricks to save disk space and improve performance, so it should generally be faster than Git LFS. Learning git-annex, however, feels like learning Git: you always feel you are not quite there and you can always learn more. It's a double-edged sword and can feel empowering for some users and terrifyingly hard for others. Where you stand on the "power-user" scale, along with project-specific requirements will ultimately determine which solution is the right one for you.

Ironically, after thorough evaluation of large-file solutions for the Debian security tracker, I ended up proposing to rewrite history and split the file by year which improved all performance markers by at least an order of magnitude. As it turns out, keeping history is critical for the security team so any solution that moves large files outside of the Git repository is not acceptable to them. Therefore, before adding large files into Git, you might want to think about organizing your content correctly first. But if large files are unavoidable, the Git LFS and git-annex projects allow users to keep using most of their current workflow.

This article first appeared in the Linux Weekly News.

Categories: External Blogs

October 2018 report: LTS, Monkeysphere, Flatpak, Kubernetes, CD archival and calendar project

Thu, 11/01/2018 - 15:12
Debian Long Term Support (LTS)

This is my monthly Debian LTS report.

GnuTLS

As discussed last month, one of the options to resolve the pending GnuTLS security issues was to backport the latest 3.3.x series (3.3.30), an update proposed then uploaded as DLA-1560-1. I after a suggestion, I've included an explicit NEWS.Debian item warning people about the upgrade, a warning also included in the advisory itself.

The most important change is probably dropping SSLv3, RC4, HMAC-SHA384 and HMAC-SHA256 from the list of algorithms, which could impact interoperability. Considering how old RC4 and SSLv3 are, however, this should be a welcome change. As for the HMAC changes, those are mandatory to fix the targeted vulnerabilities (CVE-2018-10844, CVE-2018-10845, CVE-2018-10846).

Xen

Xen updates had been idle for a while in LTS, so I bit the bullet and made a first discovery of the pending vulnerabilities. I sent the result to the folks over at Credativ who maintain the 4.4 branch and they came back with a set of proposed updates which I briefly review. Unfortunately, the patches were too deep for me: all I was able to do was to confirm consistency with upstream patches.

I also brought up a discussion regarding the viability of Xen in LTS, especially regarding the "speculative execution" vulnerabilities (XSA-254 and related). My understanding is upstream Xen fixes are not (yet?) complete, but apparently that is incorrect as Peter Dreuw is "condident in the Xen project to provide a solution for these issues". I nevertheless consider, like RedHat that the simpler KVM implementation might provide more adequate protection against those kind of attacks and LTS users should seriously consider switching to KVM for hosing untrusted virtual machines, even if only because that code is actually mainline in the kernel while Xen is unlikely to ever be. It might be, as Dreuw said, simpler to upgrade to stretch than switch virtualization systems...

When all is said and done, however, Linux and KVM are patches in Jessie at the time of writing, while Xen is not (yet).

Enigmail

I spent a significant amount of time working on Enigmail this month again, this time specifically working on reviewing the stretch proposed update to gnupg from Daniel Kahn Gillmor (dkg). I did not publicly share the code review as we were concerned it would block the stable update, which seemed to be in jeopardy when I started working on the issue. Thankfully, the update went through but it means it might impose extra work on leaf packages. Monkeysphere, in particular, might fail to build from source (FTBFS) after the gnupg update lands.

In my tests, however, it seems that packages using GPG can deal with the update correctly. I tested Monkeysphere, Password Store, git-remote-gcrypt and Enigmail, all of which passed a summary smoke test. I have tried to summarize my findings on the mailing list. Basically our options for the LTS update are:

  1. pretend Enigmail works without changing GnuPG, possibly introducing security issues

  2. ship a backport of GnuPG and Enigmail through jessie-sloppy-backports

  3. package OpenPGP.js and backport all the way down to jessie

  4. remove Enigmail from jessie

  5. backport the required GnuPG patchset from stretch to jessie

So far I've taken that last step as my favorite approach...

Firefox / Thunderbird and finding work

... which brings us to the Firefox and Thunderbird updates. I was assuming those were going ahead, but the status of those updates currently seems unclear. This is a symptom of a larger problem in the LTS work organization: some packages can stay "claimed" for a long time without an obvious status update.

We discussed ways of improving on this process and, basically, I will try to be more proactive in taking over packages from others and reaching out to others to see if they need help.

A note on GnuPG

As an aside to the Enigmail / GnuPG review, I was struck by the ... peculiarities in the GnuPG code during my review. I discovered that GnuPG, instead of using the standard resolver, implements its own internal full-stack DNS server, complete with UDP packet parsing. That's 12 000 lines of code right there. There are also abstraction leaks like using "1" and "0" as boolean values inside functions (as opposed to passing an integer and converting as string on output).

A major change in the proposed patchset are changes to the --with-colons batch output, which GnuPG consumers (like GPGME) are supposed to use to interoperate with GnuPG. Having written such a parser myself, I can witness to how difficult parsing those data structures is. Normally, you should always be using GPGME instead of parsing those directly, but unfortunately GPGME does not do everything GPG does: signing operations and keyring management, for example, has long been considered out of scope, so users are force to parse that output.

Long story short, GPG consumers still use --with-colons directly (and that includes Enigmail) because they have to. In this case, critical components were missing from that output (e.g. knowing which key signed which UID) so they were added in the patch. That's what breaks the Monkeysphere test suite, which doesn't expect a specific field to be present. Later versions of the protocol specification have been updated (by dkg) to clarify that might happen, but obviously some have missed the notice, as it came a bit late.

In any case, the review did not make me confident in the software architecture or implementation of the GnuPG program.

autopkgtest testing

As part of our LTS work, we often run tests to make sure everything is in order. Starting with Jessie, we are now seeing packages with autopkgtest enabled, so I started meddling with that program. One of the ideas I was hoping to implement was to unify my virtualization systems. Right now I'm using:

Because sbuild can talk with autopkgtest, and autopkgtest can talk with qemu (which can use KVM images), I figured I could get rid of schroot. Unfortunately, I met a few snags;

  • #911977: how do we correctly guess the VM name in autopkgtest?
  • #911963: qemu build fails with proxy_cmd: parameter not set (fixed and provided a patch)
  • #911979: fails on chown in autopkgtest-qemu backend
  • #911981: qemu server warns about missing CPU features

So I gave up on that approach. But I did get autopkgtest working and documented the process in my quick Debian development guide.

Oh, and I also got sucked down into wiki stylesheet (#864925) after battling with the SystemBuildTools page.

Spamassassin followup

Last month I agreed we could backport the latest upstream version of SpamAssassin (a recurring pattern). After getting the go from the maintainer, I got a test package uploaded but the actual upload will need to wait for the stretch update (#912198) to land to avoid a versioning conflict.

Salt Stack

My first impression of Salt was not exactly impressive. The CVE-2017-7893 issue was rather unclear: first upstream fixed the issue, but reverted the default flag which would enable signature forging after it was discovered this would break compatibility with older clients.

But even worse, the 2014 version of Salt shipped in Jessie did not have master signing in the first place, which means there was simply no way to protect from master impersonation, a worrisome concept. But I assumed this was expected behavior and triaged this away from jessie, and tried to forgot about the horrors I had seen.

phpLDAPadmin with sunweaver

I looked next at the phpLDAPadmin (or PHPLDAPadmin?) vulnerabilities, but could not reproduce the issue using the provided proof of concept. I have also audited the code and it seems pretty clear the code is protected against such an attack, as was explained by another DD in #902186. So I asked Mitre for rejection, and uploaded DLA-1561-1 to fix the other issue (CVE-2017-11107). Meanwhile the original security researcher acknowledged the security issue was a "false positive", although only in a private email.

I almost did a NMU for the package but the security team requested to wait, and marked the package as grave so it gets kicked out of buster instead. I at least submitted the patch, originally provided by Ubuntu folks, upstream.

Smarty3

Finally, I worked on the smart3 package. I confirmed the package in jessie is not vulnerable, because Smarty hadn't yet had the brilliant idea of "optimizing" realpath by rewriting it with new security vulnerabilities. Indeed, the CVE-2018-13982 proof of content and CVE-2018-16831 proof of content both fail in jessie.

I have tried to audit the patch shipped with stretch to make sure it fixed the security issue in question (without introducing new ones of course) abandoned parsing the stretch patch because this regex gave me a headache:

'%^(?<root>(?:<span class="createlink"><a href="/ikiwiki.cgi?do=create&amp;from=blog%2F2018-11-01-report&amp;page=%3Aalpha%3A" rel="nofollow">?</a>:alpha:</span>:[\\\\]|/|[\\\\]{2}<span class="createlink"><a href="/ikiwiki.cgi?do=create&amp;from=blog%2F2018-11-01-report&amp;page=%3Aalpha%3A" rel="nofollow">?</a>:alpha:</span>+|<span class="createlink"><a href="/ikiwiki.cgi?do=create&amp;from=blog%2F2018-11-01-report&amp;page=%3Aprint%3A" rel="nofollow">?</a>:print:</span>{2,}:[/]{2}|[\\\\])?)(?<path>(?:<span class="createlink"><a href="/ikiwiki.cgi?do=create&amp;from=blog%2F2018-11-01-report&amp;page=%3Aprint%3A" rel="nofollow">?</a>:print:</span>*))$%u', "who is supporting our users?"

I finally participated in a discussion regarding concerns about support of cloud images for LTS releases. I proposed that, like other parts of Debian, responsibility of those images would shift to the LTS team when official support is complete. Cloud images fall in that weird space (ie. "Installing Debian") which is not traditionally covered by the LTS team.

Hopefully that will become the policy, but only time will tell how this will play out.

Other free software work irssi sandbox

I had been uncomfortable running irssi as my main user on my server for a while. It's a constantly running network server, sometimes connecting to shady servers too. So it made sense to run this as a separate user and, while I'm there, start it automatically on boot.

I created the following file in /etc/systemd/system/irssi@.service, based on this gist:

[Unit] Description=IRC screen session After=network.target [Service] Type=forking User=%i ExecStart=/usr/bin/screen -dmS irssi irssi ExecStop=/usr/bin/screen -S irssi -X stuff '/quit\n' NoNewPrivileges=true [Install] WantedBy=multi-user.target

A whole apparmor/selinux/systemd profile could be written for irssi of course, but I figured I would start with NoNewPrivileges. Unfortunately, that line breaks screen, which is sgid utmp which is some sort of "new privilege". So I'm running this as a vanilla service. To enable, simply enable the service with the right username, previously created with adduser:

systemctl enable irssi@foo.service systemctl start irssi@foo.service

Then I join the session by logging in as the foo user, which can be configured in .ssh/config as a convenience host:

Host irc.anarc.at Hostname shell.anarc.at User foo IdentityFile ~/.ssh/id_ed25519_irc # using command= in authorized_keys until we're all on buster #RemoteCommand screen -x RequestTTY force

Then the ssh irc.anarc.at command rejoins the screen session.

Monkeysphere revival

Monkeysphere was in bad shape in Debian buster. The bit rotten test suite was failing and the package was about to be removed from the next Debian release. I filed and worked on many critical bugs (Debian bug #909700, Debian bug #908228, Debian bug #902367, Debian bug #902320, Debian bug #902318, Debian bug #899060, Debian bug #883015) but the final fix came from another user. I was also welcome on the Debian packaging team which should allow me to make a new release next time we have similar issues, which was a blocker this time round.

Unfortunately, I had to abandon the Monkeysphere FreeBSD port. I had simply forgotten about that commitment and, since I do not run FreeBSD anywhere anymore, it made little sense to keep on doing so, especially since most of the recent updates were done by others anyways.

Calendar project

I've been working on a photography project since the beginning of the year. Each month, I pick the best picture out of my various shoots and will collect those in a 2019 calendar. I documented my work in the photo page, but most of my work in October was around finding a proper tool to layout the calendar itself. I settled on wallcalendar, a beautiful LaTeX template, because the author was very responsive to my feature request.

I also figured out which events to include in the calendar and a way to generate moon phases (now part of the undertime package) for the local timezone. I still have to figure out which other astronomical events to include. I had no response from the local Planetarium but (as always) good feedback from NASA folks which pointed me at useful resources to top up the calendar.

Kubernetes

I got deeper into Kubernetes work by helping friends setup a cluster and share knowledge on how to setup and manage the platforms. This led me to fix a bug in Kubespray, the install / upgrade tool we're using to manage Kubernetes. To get the pull request accepted, I had to go through the insanely byzantine CLA process of the CNCF, which was incredibly frustrating, especially since it was basically a one-line change. I also provided a code review of the Nextcloud helm chart and reviewed the python-hvac ITP, one of the dependencies of Kubespray.

As I get more familiar with Kubernetes, it does seem like it can solve real problems especially for shared hosting providers. I do still feel it's overly complex and over-engineered. It's difficult to learn and moving too fast, but Docker and containers are such a convenient way to standardize shipping applications that it's hard to deny this new trend does solve a problem that we have to fix right now.

CD archival

As part of my work on archiving my CD collection, I contributed three pull requests to fix issues I was having with the project, mostly regarding corner cases but also improvements on the Dockerfile. At my suggestion, upstream also enabled automatic builds for the Docker image which should make it easier to install and deploy.

I still wish to write an article on this, to continue my series on archives, which could happen in November if I can find the time...

Flatpak conversion

After reading a convincing benchmark I decided to give Flatpak another try and ended up converting all my Snap packages to Flatpak.

Flatpak has many advantages:

  • it's decentralized: like APT or F-Droid repositories, anyone can host their own (there is only one Snap repository, managed by Canonical)

  • it's faster: the above benchmarks hinted at this, but I could also confirm Signal starts and runs faster under Flatpak than Snap

  • it's standardizing: many of the work Flatpak is doing to make sense of how to containerize desktop applications is being standardized (and even adopted by Snap)

Much of this was spurred by the breakage of Zotero in Debian (Debian bug #864827) due to the Firefox upgrade. I made a wiki page to tell our users how to install Zotero in Debian considering Zotero might take a while to be packaged back in Debian (Debian bug #871502).

Debian work

Without my LTS hat, I worked on the following packages:

Other work

Usual miscellaneous:

Categories: External Blogs

Archived a part of my CD collection

Thu, 10/11/2018 - 10:52

After about three days of work, I've finished archiving a part of my old CD collection. There were about 200 CDs in a cardboard box that were gathering dust. After reading Jonathan Dowland's post about CD archival, I got (rightly) worried it would be damaged beyond rescue so I sat down and did some research on the rescue mechanisms. My notes are in rescue and I hope to turn this into a more agreeable LWN article eventually.

I post this here so I can put a note in the box with a permanent URL for future reference as well.

Remaining work

All the archives created were dumped in the ~/archive or ~/mp3 directories on curie. Data needs to be deduplicated, replicated, and archived somewhere more logical.

Inventory

I have a bunch of piles:

  • a spindle of disks that consists mostly of TV episodes, movies, distro and Windows images/ghosts. not imported.
  • a pile of tapes and Zip drives. not imported.
  • about fourty backup disks. not imported.
  • about five "books" disks of various sorts. ISOs generated. partly integrated in my collection, others failed to import or were in formats that were considered non-recoverable
  • a bunch of orange seeds piles
    • Burn Your TV masters and copies
    • apparently live and unique samples - mostly imported in mp3
    • really old stuff with tons of dupes - partly sorted through, in jams4, reste still in the pile
  • a pile of unidentified disks

All disks were eventually identified as trash, blanks, perfect, finished, defective, or not processed. A special needs attention stack was the "to do" pile, and would get sorted through the other piles. each pile was labeled with a sticky note and taped together summarily.

A post-it pointing to the blog post was included in the box, along with a printed version of the blog post summarizing a snapshot of this inventory.

Here is a summary of what's in the box.

Type Count Note trash 13 non-recoverable. not detected by the Linux kernel at all and no further attempt has been made to recover them. blanks 3 never written to, still usable perfect 28 successfully archived, without errors finished 4 almost perfect: but mixed-mode or multi-session defective 21 found to have errors but not considered important enough to re-process total 69 not processed ~100 visual estimate
Categories: External Blogs

September 2018 report: LTS, Mastodon, Firefox privacy, etc

Mon, 10/01/2018 - 15:28
Debian Long Term Support (LTS)

This is my monthly Debian LTS report.

Python updates

Uploaded DLA-1519-1 and DLA-1520-1 to fix CVE-2018-1000802, CVE-2017-1000158, CVE-2018-1061 and CVE-2018-1060 in Python 2.7 and 3.4. The latter three were originally marked as no-dsa but the fix was trivial to backport. I also found that CVE-2017-1000158 was actually relevant for 3.4 even though it was not marked as such in the tracker.

CVE-2018-1000030 was skipped because the fix was too intrusive and unclear.

Enigmail investigations

Security support for Thunderbird and Firefox versions from jessie has stopped upstream. Considering that the Debian security team bit the bullet and updated those in stretch, the consensus seems to be that the versions in jessie will also be updated, which will break third-party extensions in jessie.

One of the main victims of the XULocalypse is Enigmail, which completely stopped working after the stretch update. I looked at how we could handle this. I first proposed to wait before trying to patch the Enigmail version in jessie since it would break when the Thunderbird updates will land. I then detailed five options for the Enigmail security update:

  1. update GnuPG 2 in jessie-security to work with Enigmail, which could break unrelated things

  2. same as 1, but in jessie-backports-sloppy

  3. package the JavaScript dependencies to ship Enigmail with OpenPGP.js correctly.

  4. remove Enigmail from jessie

  5. backport only some patches to GPG 2 in jessie

I then looked at helping the Enigmail maintainers by reviewing the OpenPGP.js packaging through which I found a bug in the JavaScript packaging toolchain, which diverged into a patch in npm2deb to fix source package detection and an Emacs function to write to multiple files. (!!) That work was not directly useful to Jessie, I must admit, but it did end up clarifying which dependencies were missing for OpenPGP to land, which were clearly out of reach of a LTS update.

Switching gears, I tried to help the maintainer untangle the JavaScript mess between multiple copies of code in TB, FF (with itself), and Enigmail's process handling routines; to call GPG properly with multiple file descriptors for password, clear-text, statusfd, and output; to have Autocrypt be able to handle "Autocrypt Setup Messages" (ASM) properly (bug #908510); to finally make the test suite pass. The alternative here would be to simply rip Autocrypt out of Enigmail for the jessie update, but this would mean diverging significantly from the upstream version.

Reports of Enigmail working with older versions of GPG are deceiving, as that configuration introduces unrelated security issues (T4017 and T4018 in upstream's bugtracker).

So much more work remains on backporting Enigmail, but I might work for the stable/unstable updates to complete before pushing that work further. Instead, I might focus on the Thunderbird and Firefox updates next.

GnuTLS

I worked more on the GnuTLS research as a short followup to our previous discussion.

I wrote the researchers who "still stand behind what is written in the paper" and believe the current fix in GnuTLS is incomplete. GnuTLS upstream seems to agree, more or less, but point out that the fix, even if incomplete, greatly reduces the scope of those vulnerabilities and a long-term fix is underway.

Next step, therefore, is deciding if we backport the patches or just upgrade to the latest 3.3.x series, as the ABI/API changes are minor (only additions).

Other work
  • completed the work on gdm3 and git-annex by uploading DLA-1494-1 and DLA-1495-1

  • fixed Debian bug #908062 in devscripts to make dch generate proper version numbers since jessie was released

  • checked with the Spamassessin maintainer regarding the LTS update and whether we just use 3.4.2 across all suites

  • reviewed and tested Hugo's work on 389-ds. That involved getting familiar with that "other" slapd server (apart from OpenLDAP) which I did not know about.

  • checked that kdepim doesn't load external content so it is not vulnerable to EFAIL by default. The proposed upstream patch changes the API so that work is postponed.

  • triaged the Xen security issues by severity

  • filed bugs about Docker security issues (CVE-2017-14992 and CVE-2018-10892)

Other free software work

I have, this month again, been quite spread out on many unrelated projects unfortunately.

Mastodon

I've played around with the latest attempt from the free software community to come up with a "federation" model to replace Twitter and other social networks, Mastodon. I've had an account for a while but I haven't talked about it much here yet.

My Mastodon account is linked with my Twitter account through some unofficial Twitter cross-posting app which more or less works. Another "app" I use is the toot client to connect my website with Mastodon through feed2exec.

And because all of this social networking stuff is just IRC 2.0, I read it all through my IRC client, thanks to Bitlbee and Mastodon is (thankfully) no exception. Unfortunately, there's a problem in my hosting provider's configuration which has made it impossible to read Mastodon status from Bitlbee for a while. I've created a test profile on the main Mastodon instance to double-check, and indeed, Bitlbee works fine there.

Before I figured that out, I tried upgrading the Bitlbee Mastodon bridge (for which I also filed a RFP) and found a regression has been introduced somewhere after 1.3.1. On the plus side, the feature request I filed to allow for custom visibility statuses from Bitlbee has been accepted, which means it's now possible to send "private" messages from Bitlbee.

Those messages, unfortunately, are not really private: they are visible to all followers, which, in the social networking world, means a lot of people. In my case, I have already accepted over a dozen followers before realizing how that worked, and I do not really know or trust most of those people. I have still 15 pending follow requests which I don't want to approve until there's a better solution, which would probably involve two levels of followship. There's at least one proposal to fix this already.

Another thing I'm concerned about with Mastodon is account migration: what happens if I'm unhappy with my current host? Or if I prefer to host it myself? My online identity is strongly tied with that hostname and there doesn't seem to be good mechanisms to support moving around Mastodon instances. OpenID had this concept of delegation where the real OpenID provider could be discovered and redirected, keeping a consistent identity. Mastodon's proposed solutions seem to aim at using redirections or at least informing users your account has moved which isn't as nice, but might be an acceptable long-term compromise.

Finally, it seems that Mastodon will likely end up in the same space as email with regards to abuse: we are already seeing block lists show up to deal with abusive servers, which is horribly reminiscent of the early days of spam fighting, where you could keep such lists (as opposed to bayesian or machine learning). Fundamentally, I'm worried about the viability of this ecosystem, just like I'm concerned about the amount of fake news, spam, and harassment that takes place on commercial platforms. One theory is that the only way to fix this is to enforce two-way sharing between followers, the approach taken by Manyverse and Scuttlebutt.

Only time will tell, I guess, but Mastodon does look like a promising platform, at least in terms of raw numbers of users...

The ultimate paste bin?

I've started switching towards ptpb.pw as a pastebin. Besides the unfortunate cryptic name, it's a great tool: multiple pastes are deduplicated, large pastes are allowed, there is a (limited) server-side viewing mechanism (allowing for some multimedia), etc. The only things missing are "burn after reading" (one-shot links) and client-side encryption yet the latter is planned.

I like the simplistic approach to the API that makes it easy to use from any client. I've submitted the above feature request and a trivial patch so far.

ELPA packaging work

I've done a few reviews and sponsoring of Emacs List Packages ("ELPA") for Debian, mostly for packages I requested myself but who were so nicely made by Nicolas (elpa-markdown-toc, elpa-auto-dictionary). To better figure out which packages are missing, I wrote this script to parse the output from an ELPA and compare it with what is in Debian. This involved digging deep into the API of the Debian archive, which in turn was useful for the JavaScript work previously mentioned. The result is in the firefox page which lists all the extensions I use and their equivalent in Debian.

I'm not very happy with the script: it's dirty, and I feel dirty. It seems to me this should be done on the fly, through some web service, and should support multiple languages. It seems we are constantly solving this problem for each ecosystem while the issues are similar...

Firefox privacy issues

I went down another rabbit hole after learning about Mozilla's plan to force more or less mandatory telemetry in future versions of Firefox. That got me thinking of how many such sniffers were in Firefox and I was in for a bad surprise. It took about a day to establish a (probably incomplete) list of settings necessary to disable all those trackers in a temporary profile starter, originally designed as a replacement for chromium --temp-profile but which turned out to be a study of Firefox's sins.

There are over a hundred of about:config settings that need to be tweaked if someone wants to keep their privacy intact in Firefox. This is especially distressing because Mozilla prides itself on its privacy politics. I've documented this in the Debian wiki as well.

Ideally, there would be a one-shot toggle to disable all those things. Instead, Mozilla is forcing us to play "whack-a-mole" as they pop out another undocumented configuration item with every other release.

Other work
Categories: External Blogs

Archiving web sites

Mon, 09/24/2018 - 19:00

I recently took a deep dive into web site archival for friends who were worried about losing control over the hosting of their work online in the face of poor system administration or hostile removal. This makes web site archival an essential instrument in the toolbox of any system administrator. As it turns out, some sites are much harder to archive than others. This article goes through the process of archiving traditional web sites and shows how it falls short when confronted with the latest fashions in the single-page applications that are bloating the modern web.

Converting simple sites

The days of handcrafted HTML web sites are long gone. Now web sites are dynamic and built on the fly using the latest JavaScript, PHP, or Python framework. As a result, the sites are more fragile: a database crash, spurious upgrade, or unpatched vulnerability might lose data. In my previous life as web developer, I had to come to terms with the idea that customers expect web sites to basically work forever. This expectation matches poorly with "move fast and break things" attitude of web development. Working with the Drupal content-management system (CMS) was particularly challenging in that regard as major upgrades deliberately break compatibility with third-party modules, which implies a costly upgrade process that clients could seldom afford. The solution was to archive those sites: take a living, dynamic web site and turn it into plain HTML files that any web server can serve forever. This process is useful for your own dynamic sites but also for third-party sites that are outside of your control and you might want to safeguard.

For simple or static sites, the venerable Wget program works well. The incantation to mirror a full web site, however, is byzantine:

$ nice wget --mirror --execute robots=off --no-verbose --convert-links \ --backup-converted --page-requisites --adjust-extension \ --base=./ --directory-prefix=./ --span-hosts \ --domains=www.example.com,example.com http://www.example.com/

The above downloads the content of the web page, but also crawls everything within the specified domains. Before you run this against your favorite site, consider the impact such a crawl might have on the site. The above command line deliberately ignores [robots.txt][] rules, as is now common practice for archivists, and hammer the website as fast as it can. Most crawlers have options to pause between hits and limit bandwidth usage to avoid overwhelming the target site.

The above command will also fetch "page requisites" like style sheets (CSS), images, and scripts. The downloaded page contents are modified so that links point to the local copy as well. Any web server can host the resulting file set, which results in a static copy of the original web site.

That is, when things go well. Anyone who has ever worked with a computer knows that things seldom go according to plan; all sorts of things can make the procedure derail in interesting ways. For example, it was trendy for a while to have calendar blocks in web sites. A CMS would generate those on the fly and make crawlers go into an infinite loop trying to retrieve all of the pages. Crafty archivers can resort to regular expressions (e.g. Wget has a --reject-regex option) to ignore problematic resources. Another option, if the administration interface for the web site is accessible, is to disable calendars, login forms, comment forms, and other dynamic areas. Once the site becomes static, those will stop working anyway, so it makes sense to remove such clutter from the original site as well.

JavaScript doom

Unfortunately, some web sites are built with much more than pure HTML. In single-page sites, for example, the web browser builds the content itself by executing a small JavaScript program. A simple user agent like Wget will struggle to reconstruct a meaningful static copy of those sites as it does not support JavaScript at all. In theory, web sites should be using progressive enhancement to have content and functionality available without JavaScript but those directives are rarely followed, as anyone using plugins like NoScript or uMatrix will confirm.

Traditional archival methods sometimes fail in the dumbest way. When trying to build an offsite backup of a local newspaper (pamplemousse.ca), I found that WordPress adds query strings (e.g. ?ver=1.12.4) at the end of JavaScript includes. This confuses content-type detection in the web servers that serve the archive, which rely on the file extension to send the right Content-Type header. When such an archive is loaded in a web browser, it fails to load scripts, which breaks dynamic websites.

As the web moves toward using the browser as a virtual machine to run arbitrary code, archival methods relying on pure HTML parsing need to adapt. The solution for such problems is to record (and replay) the HTTP headers delivered by the server during the crawl and indeed professional archivists use just such an approach.

Creating and displaying WARC files

At the Internet Archive, Brewster Kahle and Mike Burner designed the ARC (for "ARChive") file format in 1996 to provide a way to aggregate the millions of small files produced by their archival efforts. The format was eventually standardized as the WARC ("Web ARChive") specification that was released as an ISO standard in 2009 and revised in 2017. The standardization effort was led by the International Internet Preservation Consortium (IIPC), which is an "international organization of libraries and other organizations established to coordinate efforts to preserve internet content for the future", according to Wikipedia; it includes members such as the US Library of Congress and the Internet Archive. The latter uses the WARC format internally in its Java-based Heritrix crawler.

A WARC file aggregates multiple resources like HTTP headers, file contents, and other metadata in a single compressed archive. Conveniently, Wget actually supports the file format with the --warc parameter. Unfortunately, web browsers cannot render WARC files directly, so a viewer or some conversion is necessary to access the archive. The simplest such viewer I have found is pywb, a Python package that runs a simple webserver to offer a Wayback-Machine-like interface to browse the contents of WARC files. The following set of commands will render a WARC file on http://localhost:8080/:

$ pip install pywb $ wb-manager init example $ wb-manager add example crawl.warc.gz $ wayback

This tool was, incidentally, built by the folks behind the Webrecorder service, which can use a web browser to save dynamic page contents.

Unfortunately, pywb has trouble loading WARC files generated by Wget because it followed an inconsistency in the 1.0 specification, which was fixed in the 1.1 specification. Until Wget or pywb fix those problems, WARC files produced by Wget are not reliable enough for my uses, so I have looked at other alternatives. A crawler that got my attention is simply called crawl. Here is how it is invoked:

$ crawl https://example.com/

(It does say "very simple" in the README.) The program does support some command-line options, but most of its defaults are sane: it will fetch page requirements from other domains (unless the -exclude-related flag is used), but does not recurse out of the domain. By default, it fires up ten parallel connections to the remote site, a setting that can be changed with the -c flag. But, best of all, the resulting WARC files load perfectly in pywb.

Future work and alternatives

There are plenty more resources for using WARC files. In particular, there's a Wget drop-in replacement called Wpull that is specifically designed for archiving web sites. It has experimental support for PhantomJS and youtube-dl integration that should allow downloading more complex JavaScript sites and streaming multimedia, respectively. The software is the basis for an elaborate archival tool called ArchiveBot, which is used by the "loose collective of rogue archivists, programmers, writers and loudmouths" at ArchiveTeam in its struggle to "save the history before it's lost forever". It seems that PhantomJS integration does not work as well as the team wants, so ArchiveTeam also uses a rag-tag bunch of other tools to mirror more complex sites. For example, snscrape will crawl a social media profile to generate a list of pages to send into ArchiveBot. Another tool the team employs is crocoite, which uses the Chrome browser in headless mode to archive JavaScript-heavy sites.

This article would also not be complete without a nod to the HTTrack project, the "website copier". Working similarly to Wget, HTTrack creates local copies of remote web sites but unfortunately does not support WARC output. Its interactive aspects might be of more interest to novice users unfamiliar with the command line.

In the same vein, during my research I found a full rewrite of Wget called Wget2 that has support for multi-threaded operation, which might make it faster than its predecessor. It is missing some features from Wget, however, most notably reject patterns, WARC output, and FTP support but adds RSS, DNS caching, and improved TLS support.

Finally, my personal dream for these kinds of tools would be to have them integrated with my existing bookmark system. I currently keep interesting links in Wallabag, a self-hosted "read it later" service designed as a free-software alternative to Pocket (now owned by Mozilla). But Wallabag, by design, creates only a "readable" version of the article instead of a full copy. In some cases, the "readable version" is actually unreadable and Wallabag sometimes fails to parse the article. Instead, other tools like bookmark-archiver or reminiscence save a screenshot of the page along with full HTML but, unfortunately, no WARC file that would allow an even more faithful replay.

The sad truth of my experiences with mirrors and archival is that data dies. Fortunately, amateur archivists have tools at their disposal to keep interesting content alive online. For those who do not want to go through that trouble, the Internet Archive seems to be here to stay and Archive Team is obviously working on a backup of the Internet Archive itself.

This article first appeared in the Linux Weekly News.

As usual, here's the list of issues and patches generated while researching this article:

I also want to personally thank the folks in the #archivebot channel for their assistance and letting me play with their toys.

The Pamplemousse crawl is now available on the Internet Archive, it might end up in the wayback machine at some point if the Archive curators think it is worth it.

Another example of a crawl is this archive of two Bloomberg articles which the "save page now" feature of the Internet archive wasn't able to save correctly. But webrecorder.io could! Those pages can be seen in the web recorder player to get a better feel of how faithful a WARC file really is.

Finally, this article was originally written as a set of notes and documentation in the archive page which may also be of interest to my readers.

Categories: External Blogs

August 2018 report: LTS, Debian, Upgrades

Fri, 08/31/2018 - 19:19
Debian Long Term Support (LTS)

This is my monthly Debian LTS report.

twitter-bootstrap

I researched some of the security issue of the Twitter Bootstrap framework which is clearly showing its age in Debian. Of the three vulnerabilities, I couldn't reproduce two (CVE-2018-14041 and CVE-2018-14042) so I marked them as "not affecting" jessie. I also found that CVE-2018-14040 was relevant only for Bootstrap 3 (because yes, we still have Bootstrap 2, in all suites, which will hopefully be fixed in buster)

The patch for the latter was a little tricky to figure out, but ended up being simple. I tested the patch with a private copy of the code which works here and published the result as DLA-1479-1.

What's concerning with this set of vulnerabilities is they show a broader problem than the one identified in those specific instances. May found at least one similar other issue although I wasn't able to exploit it in a quick attempt. Besides, I'm not sure we want to audit the entire Bootstrap codebase: upstream fixed this issue more widely in the v4 series, and Debian should follow suite, at least in future releases, and remove older releases from the archive.

tiff

A classic. I tried and failed to reproduce CVE-2018-15209 in the tiff package. I'm a bit worried by Brian May's results that the proof of concept eats up all memory in his tests. Since I could not reproduce, I marked the package as N/A in jessie and moved on.

Ruby 2.1

Another classic source of vulnerabilities... The patches were easy to backport, tests passed, so I just uploaded and published DLA-1480-1.

GDM 3

I reviewed Markus Koschany's work on CVE-2018-14424. His patches seemed to work in my tests as I couldn't see any segfault in jessie, either in the kernel messages or through a debugger.

True, the screen still "flashes" so one might think there is still a crash, but this is actually expected behavior. Indeed, this is the first D-Bus command being ran:

dbus-send --system --dest=org.gnome.DisplayManager --type=method_call --print-reply=literal /org/gnome/DisplayManager/LocalDisplayFactory org.gnome.DisplayManager.LocalDisplayFactory.CreateTransientDisplay

Or, in short, CreateTransientDisplay, which is also known as fast user switching, brings you back to the login screen. If you enter the same username and password, you get your session back. So no crash. After talking with Koschany, we'll wait a little longer for feedback from the reporter but otherwise I expect to publish the fixed package shortly.

git-annex

This is a bigger one I took from Koschany. The patch was large, and in a rather uncommon language (Haskell).

The first patch was tricky as function names had changed and some functionality (the P2P layer, the setkey command and content verification) were completely missing. On advice from upstream, the content verification functionality was backported as it was critical for the second tricky patch which required more Haskell gymnastics.

This time again, Haskell was nice to work with: by changing type configurations and APIs, the compiler makes sure that everything works out and there are no inconsistencies. This logic is somewhat backwards to what we are used to: normally, in security updates, we avoid breaking APIs at all costs. But in Haskell, it's a fundamental way to make sure the system is still coherent.

More details, including embarrassing fixes to the version numbering scheme, are best explained in the email thread. An update for this will come out shortly, after giving more time for upstream to review the final patchset.

Fighting phishing

After mistyping the address of the security tracker, I ended up on this weird page:

Some phishing site masquerading as a Teksavvy customer survey.

Confused and alarmed, I thought I was being intercepted by my ISP, but after looking on their forums, I found out they actually get phished like this all the time. As it turns out, the domain name debain.org (notice the typo) is actually registered to some scammers. So I implemented a series of browser quick searches as a security measure and shared those with the community. Only after feedback from a friend did I realize that surfraw (SR) has been doing this all along. The problem with SR is that it's mostly implemented with messy shell scripts and those cannot easily be translated back into browser shortcuts, which are still useful on their own. That and the SR plugins (called "elvi" or "elvis" in plural) are horribly outdated.

Ideally, trivial "elvis" would simply be "bookmarks" (which are really just one link per line) that can then easily be translated back into browser bookmarks. But that would require converting a bunch of those plugins, something I don't currently have the energy (or time) for. All this reminds me a lot of the interwiki links from the wiki world and looks like an awful duplication of information. Even in this wiki I have similar shortcuts, which are yet another database of such redirections. Surely there is a better way than maintaining all of this separately?

Who claimed all the packages?

After struggling again to find some (easy, I admit) work, I worked on a patch to show per-user package claims. Now, if --verbose is specified, the review-update-needed script will also show a list of users who claimed packages and how many are claimed. This can help us figure out who's overloaded and might need some help.

Post-build notifications in sbuild

I sent a patch to sbuild to make sure we can hook into failed builds on completion as well as successful builds. Upstream argued this is best accomplished with a wrapper, but I believe it's unsufficient as a wrapper will not have knowledge of the sbuild internals and won't be able to effectively send notifications. It is, after all, while there is a post-build hook right now, which runs only on succesful builds.

GnuTLS and other reviews

I reviewed questions from Ola Lundqvist regarding the pending GnuTLS security vulnerabilities designated CVE-2018-10844, CVE-2018-10845 and CVE-2018-10846. Those came from a paper called Pseudo Constant Time Implementations of TLS Are Only Pseudo Secure. I am still unsure of the results: after reviewing the paper in detail, I am worried the upstream fixes are complete. Hopefully Lundqvist will figure it out but in any case I am available to review this work again next week.

I also provided advice on a squirrelmail bugfix backport suggestion.

Other free software work

I was actually on vacation this month so this is a surprising amount of activity for what was basically a week of work.

Buster upgrade

I upgraded my main workstation to buster, in order to install various Node.JS programs through npm for that Dat article (which will be public here shortly). It's part of my routine: when enough backports pile up or I need too much stuff from unstable, it's time to make the switch. This makes development on Debian easier and helps testing the next version of stable before it is released. I do this only on my most busy machine where I can fix things quickly and they break: my laptop and server remain on stable so I don't have to worry about them too much.

It was a bumpy ride: font rendering changed because of the new rendering engine in FreeType. Someone ended up finding a workaround in Debian bug #866685 which allowed me to keep the older rendering engine but I am worried it might be removed in the future. Hopefully that bug will trickle upstream and Debian users won't see a regression when they upgrade to buster.

A major issue was a tiny bug in the python-sh library which caused my entire LWN workflow to collapse. Thankfully, it turned out upstream had already released a fix and all I had to do was to update the package and NMU the result. As it turns out, I was already part of the Python team, and that should have been marked as a team upload, but I didn't know. Strange how memory works sometimes.

Other problems were similar: dictd, for example, failed to upgrade (Debian bug #906420, fixed). There are about 15 different packages that are missing from stretch: many FTBFS problems, other with real critical bugs. Others are just merely missing from the archive: I particularly pushed on wireguard (Debian bug #849308), taffybar (Debian bug #895264), and hub (Debian bug #807866).

I won't duplicate the whole upgrade documentation here, the details are in buster.

Debian 25th anniversary updates

The Debian project turned 25 this year so it was a good occasion to look back at history and present. I became a Debian Developer in 2010, a Debian maintainer in 2009, and my first contributions to the project go all the way back to 2003, when I started filing bugs. So this is anywhere between my 8th and 15th birthday in the project.

I didn't celebrate this in any special way, although I did make sure to keep my packages up to date when I returned from vacation. That meant a few uploads:

Work on smokeping and charybdis happened as part of our now regular Debian & Stuff along with LeLutin which is helping and learning a few packaging tricks along the way.

Other software upgrades

During the above buster upgrade, Prometheus broke because the node exporter metrics labels changed. More accurately, what happened is that Grafana would fail to display some of the nodes. As it turns out, all that was needed was to update a few Grafana dashboard (as those don't update automatically of course). But it brought to my attention that a bunch of packages I had installed were not being upgraded as automatically as the rest of my Debian infrastructure. There were a few reasons for that:

  1. packages were installed from a third-party repository

  2. packages were installed from unstable

  3. there were no packages: software was installed in a container

  4. there were no packages: software was installed by hand

I'm not sure what is the worst between 3 and 4. As it turns out, containers were harder to deal with because they also involved upgrading docker.io which was more difficult.

For each forgotten program, I tried to make sure they wouldn't stay stale any longer in the case of 1 or 2, a proper apt preference (or "pin") was added to automate upgrades. For 3 and 4, I added the release feeds of the program to feed2exec so I get an email when upstream makes a new release.

Those are the programs I had to deal with:

  • rainloop: the upgrade guide is trivial. make a backup:

    (umask 0077; tar cfz rainloop.tgz /var/lib/rainloop)

    Then decompress archive on top. It keeps old data in rainloop/v/1.11.0.203/ which should probably be removed on next upgrade. Upgrade presumably runs when visiting the site which worked flawlessly after upgrade.

  • grafana: the upgrade guide says to backup the database in /var/lib/grafana/grafana.db but i backed up the whole thing:

    (umask 0077; tar cfz grafana /var/lib/grafana

    The upgrade from 4.x to 5.2.x was trivial and automated. There is, unfortunately, still no official package. A visit to the Grafana instance shows some style changes and improvements and that things generally just work.

  • the toot Mastodon client has entered Debian so I was able to remove yet another third party repository. this involved adding a pin to follow the buster sources for this package:

    Package: toot Pin: release n=buster Pin-Priority: 500
  • Docker is in this miserable state in stretch. There is an really old binary in jessie-backports (1.6) and for some reason I had a random version from unstable running 1.13.1~ds1-2. I upgraded to the sid version, which installs fine in stretch because golang is statically compiled. But, the containers did not restart automatically. starting them by hand gave this error:

    root@marcos:/etc/apt/sources.list.d# ~anarcat/bin/subsonic-start e4216f435be477dacd129ed8c2b23b2b317e9ef9a61906f3ba0e33265c97608e docker: Error response from daemon: OCI runtime create failed: json: cannot unmarshal object into Go value of type []string: unknown.

    Strangely, the container was started, but is not reachable over the network. The problem is that runc needs to be upgraded as well, so that was promptly fixed.

    The magic pin to follow buster is like this:

    Package: docker.io runc Pin: release n=buster Pin-Priority: 500
  • airsonic upgrades are a little trickier because I run this inside a docker container. First step is to fix the Dockerfile and rebuild the container image:

    sed -i s/10.1.1/10.1.2/ Dockerfile sudo docker build -t anarcat/airsonic .

    then the image is ready to go. the previous container needs to be stopped and the new one started:

    docker ps docker stop 78385cb29cd5 ~anarcat/bin/subsonic-start

    The latter is a script I wrote because I couldn't remember the magic startup sequence, which is silly: you'd think the Dockerfile would know stuff like that. A visit to the radio site showed that everything seemed to be in order but no deeper test was performed.

All of this assumes that updates to unstable will not be too disruptive or that, if they do, the NEWS.Debian file will warn me so I can take action. That is probably a little naive of me, but it beats having outdated infrastructure running exposed on the network.

Other work

Then there's the usual:

Categories: External Blogs