Skip to main content

Feed aggregator

Shell Scripting a Bunco Game

Linux Journal - Sun, 01/21/2018 - 10:50

Bunco—a dice game that makes Yahtzee look complicated! more>>

Categories: Linux News

Building Your Own Audible

Linux Journal - Sat, 01/20/2018 - 09:35

A quick look at some options for streaming audio books. more>>

Categories: Linux News

Using Python for Science

Linux Journal - Fri, 01/19/2018 - 11:59

Introducing Anaconda, a Python distribution for scientific research.

I've looked at several ways you could use Python to do scientific calculations in the past, but I've never actually covered how to set up and use Python itself in a way that makes scientific work easier. Anaconda does just that. more>>

Categories: Linux News

Wine, Mozilla, GNOME and DragonFly BSD

Linux Journal - Fri, 01/19/2018 - 08:38

News briefs for January 19, 2018.

Wine 3.0, an annual release, became available yesterday. According to the release notes, highlights include Direct3D 10 and 11 support, the Direct3D command stream, the Android graphics driver and improved DirectWrite and Direct2D support. more>>

Categories: Linux News

Getting Started with ncurses

Linux Journal - Thu, 01/18/2018 - 16:22

How to use curses to draw to the terminal screen.

While graphical user interfaces are very cool, not every program needs to run with a point-and-click interface. For example, the venerable vi editor ran in plain-text terminals long before the first GUI. more>>

Categories: Linux News

$25k Linux Journalism Fund

Linux Journal - Thu, 01/18/2018 - 12:52

Linux Journal's new parent, Private Internet Access, has established a $25k fund to jump-start the next generation of Linux journalism—and to spend it here, where Linux journalism started in 1994. more>>

Categories: Linux News

Purism Progress Report, Spectre Mitigation for Ubuntu, Malicious Chrome Extensions and More

Linux Journal - Thu, 01/18/2018 - 11:20

News briefs for January 18, 2018.

Purism, the group behind the security and privacy-focused Librem 5 phone just recently published a progress report highlighting the latest developments and design decisions to its crowdfunded project. Changes include an even faster processor. more>>

Categories: Linux News

New Kernel Releases, Net Neutrality, Thunderbird Survey and More

Linux Journal - Wed, 01/17/2018 - 13:54

News roundup for January 17, 2018. more>>

Categories: Linux News

Avoiding Server Disaster

Linux Journal - Wed, 01/17/2018 - 09:46

Worried that your server will go down? You should be. Here are some disaster-planning tips for server owners. more>>

Categories: Linux News

Montréal-Python 69 - Call For Proposals

Montreal Python - Wed, 01/17/2018 - 00:00

First of all, the Montreal-Python team would like to wish you a Happy New Year!

With every new year, there's resolutions made. If presenting at a the tech event is on your list of resolutions, here's your chance to cross it off your list early. Montreal Python opens the call for presentations for our events of 2018. ( Feel free to submit your proposal - whether you have resolutions or not ;) )

Send us your proposal at team@montrealpython.org. We have spots for lightning talks (5 min) or regular talks (15 to 30 min)

When

February 5th, 2018 at 6:00PM

Where

TBD

Categories: External Blogs

Monitoring with Prometheus 2.0

Anarcat - Tue, 01/16/2018 - 19:00

This is one part of my coverage of KubeCon Austin 2017. Other articles include:

Prometheus is a monitoring tool built from scratch by SoundCloud in 2012. It works by pulling metrics from monitored services and storing them in a time series database (TSDB). It has a powerful query language to inspect that database, create alerts, and plot basic graphs. Those graphs can then be used to detect anomalies or trends for (possibly automated) resource provisioning. Prometheus also has extensive service discovery features and supports high availability configurations. That's what the brochure says, anyway; let's see how it works in the hands of an old grumpy system administrator. I'll be drawing comparisons with Munin and Nagios frequently because those are the tools I have used for over a decade in monitoring Unix clusters.

Monitoring with Prometheus and Grafana

What distinguishes Prometheus from other solutions is the relative simplicity of its design: for one, metrics are exposed over HTTP using a special URL (/metrics) and a simple text format. Here is, as an example, some network metrics for a test machine:

$ curl -s http://curie:9100/metrics | grep node_network_.*_bytes # HELP node_network_receive_bytes Network device statistic receive_bytes. # TYPE node_network_receive_bytes gauge node_network_receive_bytes{device="eth0"} 2.720630123e+09 # HELP node_network_transmit_bytes Network device statistic transmit_bytes. # TYPE node_network_transmit_bytes gauge node_network_transmit_bytes{device="eth0"} 4.03286677e+08

In the above example, the metrics are named node_network_receive_bytes and node_network_transmit_bytes. They have a single label/value pair(device=eth0) attached to them, along with the value of the metrics themselves. This is only a couple of hundreds of metrics (usage of CPU, memory, disk, temperature, and so on) exposed by the "node exporter", a basic stats collector running on monitored hosts. Metrics can be counters (e.g. per-interface packet counts), gauges (e.g. temperature or fan sensors), or histograms. The latter allow, for example, 95th percentiles analysis, something that has been missing from Munin forever and is essential to billing networking customers. Another popular use for histograms is maintaining an Apdex score, to make sure that N requests are answered in X time. The various metrics types are carefully analyzed before being stored to correctly handle conditions like overflows (which occur surprisingly often on gigabit network interfaces) or resets (when a device restarts).

Those metrics are fetched from "targets", which are simply HTTP endpoints, added to the Prometheus configuration file. Targets can also be automatically added through various discovery mechanisms, like DNS, that allow having a single A or SRV record that lists all the hosts to monitor; or Kubernetes or cloud-provider APIs that list all containers or virtual machines to monitor. Discovery works in real time, so it will correctly pick up changes in DNS, for example. It can also add metadata (e.g. IP address found or server state), which is useful for dynamic environments such as Kubernetes or containers orchestration in general.

Once collected, metrics can be queried through the web interface, using a custom language called PromQL. For example, a query showing the average bandwidth over the last minute for interface eth0 would look like:

rate(node_network_receive_bytes{device="eth0"}[1m])

Notice the "device" label, which we use to restrict the search to a single interface. This query can also be plotted into a simple graph on the web interface:

What is interesting here is not really the node exporter metrics themselves, as those are fairly standard in any monitoring solution. But in Prometheus, any (web) application can easily expose its own internal metrics to the monitoring server through regular HTTP, whereas other systems would require special plugins, on both the monitoring server and the application side. Note that Munin follows a similar pattern, but uses its own text protocol on top of TCP, which means it is harder to implement for web apps and diagnose with a web browser.

However, coming from the world of Munin, where all sorts of graphics just magically appear out of the box, this first experience can be a bit of a disappointment: everything is built by hand and ephemeral. While there are ways to add custom graphs to the Prometheus web interface using Go-based console templates, most Prometheus deployments generally use Grafana to render the results using custom-built dashboards. This gives much better results, and allows graphing multiple machines separately, using the Node Exporter Server Metrics dashboard:

All this work took roughly an hour of configuration, which is pretty good for a first try. Things get tougher when extending those basic metrics: because of the system's modularity, it is difficult to add new metrics to existing dashboards. For example, web or mail servers are not monitored by the node exporter. So monitoring a web server involves installing an Apache-specific exporter that needs to be added to the Prometheus configuration. But it won't show up automatically in the above dashboard, because that's a "node exporter" dashboard, not an Apache dashboard. So you need a separate dashboard for that. This is all work that's done automatically in Munin without any hand-holding.

Even then, Apache is a relatively easy one; monitoring some arbitrary server not supported by a custom exporter will require installing a program like mtail, which parses the server's logfiles to expose some metrics to Prometheus. There doesn't seem to be a way to write quick "run this command to count files" plugins that would allow administrators to write quick hacks. The options available are writing a new exporter using client libraries, which seems to be a rather large undertaking for non-programmers. You can also use the node exporter textfile option, which reads arbitrary metrics from plain text files in a directory. It's not as direct as running a shell command, but may be good enough for some use cases. Besides, there are a large number of exporters already available, including ones that can tap into existing Nagios and Munin servers to allow for a smooth transition.

Unfortunately, those exporters will only give you metrics, not graphs. To graph metrics from a third-party Postfix exporter, a graph must be created by hand in Grafana, with a magic PromQL formula. This may involve too much clicking around in a web browser for grumpy old administrators. There are tools like Grafanalib to programmatically create dashboards, but those also involve a lot of boilerplate. When building a custom application, however, creating graphs may actually be a fun and distracting task that some may enjoy. The Grafana/Prometheus design is certainly enticing and enables powerful abstractions that are not readily available with other monitoring systems.

Alerting and high availability

So far, we've worked only with a single server, and did only graphing. But Prometheus also supports sending alarms when things go bad. After working over a decade as a system administrator, I have mixed feelings about "paging" or "alerting" as it's called in Prometheus. Regardless of how well the system is tweaked, I have come to believe it is basically impossible to design a system that will respect workers and not torture on-call personnel through sleep-deprivation. It seems it's a feature people want regardless, especially in the enterprise, so let's look at how it works here.

In Prometheus, you design alerting rules using PromQL. For example, to warn operators when a network interface is close to saturation, we could set the following rule:

alert: HighBandwidthUsage expr: rate(node_network_transmit_bytes{device="eth0"}[1m]) > 0.95*1e+09 for: 5m labels: severity: critical annotations: description: 'Unusually high bandwidth on interface {{ $labels.device }}' summary: 'High bandwidth on {{ $labels.instance }}'

Those rules are regularly checked and matching rules are fired to an alertmanager daemon that can receive alerts from multiple Prometheus servers. The alertmanager then deduplicates multiple alerts, regroups them (so a single notification is sent even if multiple alerts are received), and sends the actual notifications through various services like email, PagerDuty, Slack or an arbitrary webhook.

The Alertmanager has a "gossip protocol" to enable multiple instances to coordinate notifications. This design allows you to run multiple Prometheus servers in a federation model, all simultaneously collecting metrics, and sending alerts to redundant Alertmanager instances to create a highly available monitoring system. Those who have struggled with such setups in Nagios will surely appreciate the simplicity of this design.

The downside is that Prometheus doesn't ship a set of default alerts and exporters do not define default alerting thresholds that could be used to create rules automatically. The Prometheus documentation also lacks examples that the community could use, so alerting is harder to deploy than in classic monitoring systems.

Issues and limitations

Prometheus is already well-established: Cloudflare, Canonical and (of course) SoundCloud are all (still) using it in production. It is a common monitoring tool used in Kubernetes deployments because of its discovery features. Prometheus is, however, not a silver bullet and may not the best tool for all workloads.

In particular, Prometheus is not designed for long-term storage. By default, it keeps samples for only two weeks, which seems rather small to old system administrators who are used to RRDtool databases that efficiently store samples for years. As a comparison, my test Prometheus instance is taking up as much space for five days of samples as Munin, which has samples for the last year. Of course, Munin only collects metrics every five minutes while Prometheus samples all targets every 15 seconds by default. Even so, this difference in sizes shows that Prometheus's disk requirements are much larger than traditional RRDtool implementations because it lacks native down-sampling facilities. Therefore, retaining samples for more than a year (which is a Munin limitation I was hoping to overcome) will be difficult without some serious hacking to selectively purge samples or adding extra disk space.

The project documentation recognizes this and suggests using alternatives:

Prometheus's local storage is limited in its scalability and durability. Instead of trying to solve long-term storage in Prometheus itself, Prometheus has a set of interfaces that allow integrating with remote long-term storage systems.

Prometheus in itself delivers good performance: a single instance can support over 100,000 samples per second. When a single server is not enough, servers can federate to cover different parts of the infrastructure. And when that is not enough sharding is possible. In general, performance is dependent on avoiding variable data in labels, which keeps the cardinality of the dataset under control, but the dataset size will grow with time regardless. So long-term storage is not Prometheus' strongest suit. But starting with 2.0, Prometheus can finally write to (and read from) external storage engines that can be more efficient than Prometheus. InfluxDB, for example, can be used as a backend and supports time-based down-sampling that makes long-term storage manageable. This deployment, however, is not for the faint of heart.

Also, security freaks can't help but notice that all this is happening over a clear-text HTTP protocol. Indeed, that is by design, "Prometheus and its components do not provide any server-side authentication, authorisation, or encryption. If you require this, it is recommended to use a reverse proxy." The issue is punted to a layer above, which is fine for the web interface: it is, after all, just a few Prometheus instances that need to be protected. But for monitoring endpoints, this is potentially hundreds of services that are available publicly without any protection. It would be nice to have at least IP-level blocking in the node exporter, although this could also be accomplished through a simple firewall rule.

There is a large empty space for Prometheus dashboards and alert templates. Whereas tools like Munin or Nagios had years to come up with lots of plugins and alerts, and to converge on best practices like "70% disk usage is a warning but 90% is critical", those things all need to be configured manually in Prometheus. Prometheus should aim at shipping standard sets of dashboards and alerts for built-in metrics, but the project currently lacks the time to implement those.

The Grafana list of Prometheus dashboards shows one aspect of the problem: there are many different dashboards, sometimes multiple ones for the same task, and it's unclear which one is the best. There is therefore space for a curated list of dashboards and a definite need for expanding those to feature more extensive coverage.

As a replacement for traditional monitoring tools, Prometheus may not be quite there yet, but it will get there and I would certainly advise administrators to keep an eye on the project. Besides, Munin and Nagios feature-parity is just a requirement from an old grumpy system administrator. For hip young application developers smoking weird stuff in containers, Prometheus is the bomb. Just take for example how GitLab started integrating Prometheus, not only to monitor GitLab.com itself, but also to monitor the continuous-integration and deployment workflow. By integrating monitoring into development workflows, developers are immediately made aware of the performance impacts of proposed changes. Performance regressions can therefore be trivially identified quickly, which is a powerful tool for any application.

Whereas system administrators may want to wait a bit before converting existing monitoring systems to Prometheus, application developers should certainly consider deploying Prometheus to instrument their applications, it will serve them well.

This article first appeared in the Linux Weekly News.

Categories: External Blogs

Achieving Inbox Zero

Linux Journal - Tue, 01/16/2018 - 18:13

See how Google Inbox helps Shawn reach his quest for "inbox zero". more>>

Categories: Linux News

Linux Journal 2.0 Progress Report

Linux Journal - Tue, 01/16/2018 - 14:29

The latest updates and another request for feedback. more>>

Categories: Linux News

Firefox Release, Xen, KDE's Plasma and More

Linux Journal - Tue, 01/16/2018 - 12:04

News highlights for January 16, 2018.

Set your calendars for January 23, 2018, to download the latest Firefox 58 release packed with performance/bottleneck and bug fixes, an even better site source code debugger and more. more>>

Categories: Linux News

Linux kernel mailing list back online; Meltdown and Spectre vulnerabilities; Mobile OS eelo; Barcelona now using Linux

Linux Journal - Mon, 01/15/2018 - 17:15

News digest for January, 15, 2018.

Just released on 1-14-2018: the 4.15-rc8 Linux kernel. You can view the commit diff here, and more information is available from The Linux Kernel Archives. more>>

Categories: Linux News

Creating an Internet Radio Station with Icecast and Liquidsoap

Linux Journal - Mon, 01/15/2018 - 09:34

Ever wanted to stream prerecorded music or a live event, such as a lecture or concert for an internet audience? With Icecast and Liquidsoap, you can set up a full-featured, flexible internet radio station using free software and open standards. more>>

Categories: Linux News

Introducing the CAPS0ff Project

Linux Journal - Sun, 01/14/2018 - 10:12

How you can help retrieve ROM data for classic video games. more>>

Categories: Linux News

Visualizing Molecules with Python

Linux Journal - Sat, 01/13/2018 - 11:30

Introducing PyMOL, a Python package for studying chemical structures.

I've looked at several open-source packages for computational chemistry in the past, but in this article, I cover a package written in Python called PyMOL. more>>

Categories: Linux News

Thinking Concurrently: How Modern Network Applications Handle Multiple Connections

Linux Journal - Fri, 01/12/2018 - 08:16

Reuven explores different types of multiprocessing and looks at the advantages and disadvantages of each. more>>

Categories: Linux News

Sysadmin Tips on Preparing for Vacation

Linux Journal - Thu, 01/11/2018 - 09:35

Read on for ways to help reduce the chance that your vacation will be interrupted by sysadmin issues. more>>

Categories: Linux News
Syndicate content