[ninux-dev] Graphite:

nemesis nemesis at ninux.org
Sat Mar 29 15:10:00 CET 2014


 Ciao Nino e tutti gli altri,

 l'altro giorno vi parlavo di Graphite (python/django), che dopo un bel 
 pò di ricerche e studi credo sia il tool più adatto per immagazzinare e 
 visualizzare le metriche della rete e delle applicazioni.

 Per quanto riguarda il collezionamento dei dati sono orientato verso 
 statsd (javascript/nodejs): https://github.com/etsy/statsd/

 Vi fornisco alcune info essenziali:

 Innanzitutto, chi usa Graphite?

 Orbitz
 Sears Holdings
 Etsy (see http://codeascraft.etsy.com/2010/12/08/track-every-release/)
 Google (opensource Rocksteady project)
 Media Temple
 Canonical
 Brightcove (see http://opensource.brightcove.com/project/Diamond/)
 Vimeo
 SocialTwist
 Douban

 https://graphite.readthedocs.org/en/latest/who-is-using.html

 Com'è nato statsd?
 http://codeascraft.com/2011/02/15/measure-anything-measure-everything/

 Chi usa statsd?
 Non ho trovato una lista completa, però a quanto pare la maggior parte 
 di quelli che usano graphite lo usa in coppia con statsd.

 E tra questi a quanto pare c'è anche instagram: 
 http://instagram-engineering.tumblr.com/post/20541814340/keeping-instagram-up-with-over-a-million-new-users-in

 Alcune FAQ sparse che mi avete chiesto l'altra sera:

 What is Graphite?

 Graphite is a highly scalable real-time graphing system. As a user, you 
 write an application that collects numeric time-series data that you are 
 interested in graphing, and send it to Graphite's processing backend, 
 carbon, which stores the data in Graphite's specialized database. The 
 data can then be visualized through graphite's web interfaces.

 How scalable is Graphite?

 From a CPU perspective, Graphite scales horizontally on both the 
 frontend and the backend, meaning you can simply add more machines to 
 the mix to get more throughput. It is also fault tolerant in the sense 
 that losing a backend machine will cause a minimal amount of data loss 
 (whatever that machine had cached in memory) and will not disrupt the 
 system if you have sufficient capacity remaining to handle the load.

 From an I/O perspective, under load Graphite performs lots of tiny I/O 
 operations on lots of different files very rapidly. This is because each 
 distinct metric sent to Graphite is stored in its own database file, 
 similar to how many tools (drraw, Cacti, Centreon, etc) built on top of 
 RRD work. In fact, Graphite originally did use RRD for storage until 
 fundamental limitations arose that required a new storage engine.

 What is whisper?

 Whisper is a fixed-size database, similar in design to RRD 
 (round-robin-database). It provides fast, reliable storage of numeric 
 data over time.

 Why don't you just use RRD?

 RRD is great, and initially Graphite did use RRD for storage. Over time 
 though, we ran into several issues inherent to RRD's design.

 RRD can't take updates for a timestamp prior to its most recent update. 
 So for example, if you miss an update for some reason you have no simple 
 way of back-filling your RRD file by telling rrdtool to apply an update 
 to the past. Whisper does not have this limitation, and this makes 
 importing historical data into Graphite way way easier.
 At the time whisper was written, RRD did not support compacting 
 multiple updates into a single operation. This feature is critical to 
 Graphite's scalability.
 RRD doesn't like irregular updates. If you update an RRD but don't 
 follow up another update soon, your original update will be lost. This 
 is the straw that broke the camel's back, since Graphite is used for 
 various operational metrics, some of which do not occur regularly 
 (randomly occuring errors for instance) we started to notice that 
 Graphite sometimes wouldn't display data points which we knew existed 
 because we'd received alarms on them from other tools. The problem 
 turned out to be that RRD was dropping the data points because they were 
 irregular. Whisper had to be written to ensure that all data was 
 reliably stored and accessible.
 Why did you totally rewrite RRD? Couldn't you just submit a patch?

 I didn't totally rewrite it, I rewrote only a small subset of what RRD 
 does, its basic storage mechanism. Patching RRD would mean hundreds of 
 lines of C code, whereas Whisper is under 500 lines of simple python.

 Seriously though, the real reason I didn't simply submit a patch for 
 rrdtool is that whisper's design is incompatible with RRD's feature set. 
 RRD provides the ability to specify an arbitrary update interval, that 
 is you could say that you intend to update your RRD file once every 
 minute, every 10 minutes, whatever. And rrdtool also allows you to 
 configure your RRA's (round-robin-archives) independant of this update 
 interval, so you could have a 1-minute precision archive but an update 
 interval of say, 10 seconds. In this case, RRD will store your updates 
 in a temporary workspace area and after the minute has passed, aggregate 
 them and store them in the archive. Whisper on the other hand mandates 
 that your update interval must be the same as the finest precision 
 archive you configure. So for instance, if your archive configuration is 
 1-minute precision for 2 hours, then 5-minute precision for a day, your 
 update interval *must* be 1-minute. The reason for this is that whisper 
 inserts your updates *immediately* into your finest precision archive, 
 so another update within the same interval would overwrite the previous 
 value. Basically this just means that the onus of aggregating values to 
 fit in the finest precision archive is on the user, not the database.

 How fast is whisper?

 Whisper is fast enough. It is slower than rrdtool because whisper is 
 written in python, rrdtool is written in C, go figure. However the 
 differences in speed are quite small. I spent a lot of time optimizing 
 whisper to get as close to rrdtool's performance as I could. Currently 
 update operations take anywhere from 2 to 3 times as long as rrdtool, 
 and fetch operations take anywhere from 2 to 5 times as long. This 
 sounds a lot worse than it is (especially considering it was originally 
 20x slower for each operation) because in practice the actual difference 
 is measured in hundreds of microseconds (10^-4), so less than a 
 millisecond difference for simple cases.

 How does whisper work?

 Pretty simplistically. See for yourself, visit 
 http://bazaar.launchpad.net/~graphite-dev/graphite/main/files and click 
 lib, graphite, then whisper.py.

 --------------

 Altri link di approfondimento:
 https://graphite.readthedocs.org/en/latest/overview.html
 http://graphite.wikidot.com/faq
 http://graphite.wikidot.com/whisper

 --------------

 Mi siederò sulle spalle dei giganti, non sarò di certo io a riscrivere 
 qualcosa che gente con i controcoglioni ha già fatto e quanto sembra 
 anche bene perchè è utilizzata da migliaia di persone che contribuiscono 
 a migliorarla.

 Ogni riferimento a cose o persone esistenti è puramente casuale... tipo 
 https://github.com/wlanslovenija/datastream anche questo è casualissimo, 
 giusto per dimostrarvi che non è vero che ogni community network 
 rinventa la ruota, noooooo

 Federico


More information about the ninux-dev mailing list