Deploying Logstash with Puppet
As an extension to my last post, here's an initial recipe for deploying Logstash over Puppet. You can grab my gist for the init script here https://gist.github.com/2620449
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | class logstash { file {"/opt/logstash-1.1.0-monolithic.jar": source => "puppet:///modules/logstash/logstash-1.1.0-monolithic.jar", ensure => present } file {"/etc/init.d/logstash": source => "puppet:///modules/logstash/logstash", ensure => present, mode => 0755, owner => root, group => root } file {"/etc/logstash.conf": source => "puppet:///modules/logstash/logstash.conf", ensure => present, replace => false } service {"logstash": ensure => running, subscribe => File["/etc/logstash.conf"], require => [ File["/etc/init.d/logstash"], File["/etc/logstash.conf"], File["/opt/logstash-1.1.0-monolithic.jar"] ] } } |
You probably need different logging configurations per application, in which case the best thing to do is to move the config definition per application, or modify it where needed which is why replace is set to false so the file will only be created by Puppet once.
Centralized logging with Graylog2
Design considerations
Centralized logging isn't an easy task, you need to be able to handle very large amounts of data with a lot of write operations and heavy indexing which amounts to ample CPU and memory usage making a scalable text indexing and storage backend extremely important as well as a decoupled architecture. Over the course of a month we evaluated various tools and architectures at Praekelt to find one that worked well (never mind many that just don't work at all).The final configuration uses Ubuntu 12.04 (Precise) for the central server, Graylog2 to receive logs and do analysis etc, RabbitMQ to queue logs for Graylog2 from Logstash which aren't syslog related, then ElasticSearch and MongoDB which are used by Graylog2 to store logs and stats.
Logstash
Logstash is a good tool, I'm somewhat annoyed by it's version dependencies writing directly to Elasticsearch but no matter, we can use Graylog2 to fill the gap.
Syslog
Ubuntu is nice enough to use rsyslog since quite a while ago. While we do want to make use of remote syslog to collect the usual system logs and dispatch them to the log server, what we absolutely don't want to do is pipe noisy application logs (like HTTP access logs) into syslog. Keeping them all separate has a lot of benefits when it comes to trying to troubleshoot your system later, and avoiding possibly flooding local message facilities away. So while using rsyslog imfile is tempting and easy, it becomes difficult to manage later.
Graylog2
Where Logstash seems to fail on the centralised side of things, Graylog2 is substantially easier to deploy and works on the latest versions of Elasticsearch, and also employs MongoDB as a key store for aggregating statistics which is a very good idea. Graylog2 has some setup guides (and packages) for Ubuntu Lucid - unfortunately Lucid support just ended, so we'll just configure it from scratch in Precise.
Setup
Install some necessary packages
root@logger:~# aptitude install build-essential rabbitmq-server openjdk-6-jre-headless mongodb rubygems
Grab the deb packages for Elasticsearch from http://www.elasticsearch.org/download/ and both the Graylog2 server and web interface from http://graylog2.org/download.
root@logger:~# dpkg -i elasticsearch-0.19.3.deb
Take note of any errors or missing dependencies and make sure it starts itself up. You don't need to configure anything else for Elasticsearch. Now get RabbitMQ running a bit more securely.
root@logger:~# rabbitmqctl add_user logging mypassword
Creating user "logging" ...done.
root@logger:~# rabbitmqctl set_permissions logging ".*" ".*" ".*"
Setting permissions for user "logging" in vhost "/" ...done.
root@logger:~# rabbitmqctl delete_user guest
Deleting user "guest" ...done.
First thing to do is setup Logstash to get logs shipped over AMQP.
Client configuration
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | input { file { type => "syslog" path => [ "/var/log/messages", "/var/log/syslog", "/var/log/*.log" ] } file { type => "apache-access" path => "/var/log/nginx/access.log" } } output { amqp { host => "logger.acme.com" exchange_type => "fanout" name => "rawlogs" user => "logging" password => "mypassword" } } |
Server configuration
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | input { amqp { type => "all" host => "localhost" exchange => "rawlogs" name => "rawlogs_consumer" user => "logging" password => "mypassword" } } output { stdout { } gelf { facility => "logstash-gelf" host => '127.0.0.1' } } |
We leave the stdout output enabled just to check things are working, it's a good idea to disable it when everything is running. We essentially just use Logstash as a broker to get stuff from RabbitMQ into Graylog2 via GELF. Graylog2 does support AMQP directly, but there are some good reasons we do this - namely it doesn't support using AMQP in the same way that Logstash does.
Kickstart both of them the same way, assuming you stored the config as logstash.conf
# java -jar logstash-1.1.0-monolithic.jar agent -f logstash.conf
Next get Graylog2 going. Extract both the server and the web interface into /opt and configure the server. Copy graylog2.conf.example to /etc/graylog2.conf and make the relevant changes. You can use mongodb with or without authentication if it's not accessible externally, we're using authentication here. For setting up MongoDB authentication read more here, which has a bunch of other info on configuring Graylog2 which is worth reading.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | syslog_listen_port = 514 syslog_protocol = udp elasticsearch_url = http://localhost:9200/ elasticsearch_index_name = graylog2 force_syslog_rdns = false mongodb_useauth = true mongodb_user = graylog mongodb_password = graylog mongodb_host = localhost #mongodb_replica_set = localhost:27017,localhost:27018,localhost:27019 mongodb_database = graylog2 mongodb_port = 27017 mq_batch_size = 4000 mq_poll_freq = 1 mq_max_size = 0 mongodb_max_connections = 100 mongodb_threads_allowed_to_block_multiplier = 5 use_gelf = true gelf_listen_address = 0.0.0.0 gelf_listen_port = 12201 # AMQP amqp_enabled = false amqp_subscribed_queues = gqueue:gelf,squeue:syslog amqp_host = localhost amqp_port = 5627 amqp_username = guest amqp_password = guest amqp_virtualhost = / forwarder_loggly_timeout = 3 |
You can start Graylog2 up now with 'bin/graylog2ctl start'.
To get the web interface working, do the following
root@logger:/opt/graylog2-web-interface-0.9.6# gem install bundler
root@logger:/opt/graylog2-web-interface-0.9.6# bundle install
root@logger:/opt/graylog2-web-interface-0.9.6# script/rails runserver -e production -p 80
You should now have a running logging system. Now you should go back, get the web interface behind passenger and nginx running as an unprivileged user.
Why we should cut Jessica some slack
I have no idea who Jessica Leandra is, but a lot of people seem angry at her right now for making a racist remark on Twitter. Every now and then people do disastrously silly things to their reputation on social networks, and I guess we all know who she is now - perhaps that's the point, perhaps in anger we're all capable of being stupid. I've never really had a reputation personally, at least not one that I took any notice of, so I don't know what that's like. Apparently what she said was "Just, well took on a on arrogant and disrespectful k***** inside Spar. Should have punched him, should have.”. One thing I do hear all the time from women of all races that African men are apparently really horrible to deal with.
The whole racism issue and especially use of the 'K' word has been spinning around SA for years now though. To make one thing clear, it's not an acceptable term to use in generalizing a race, and it's not even a term which is used exclusively by one race. I know this because I was only white for half my life. Yup, gotcha. My terribly pale skin and inability to dance, rap or play basketball well have for almost a century hidden what was once a very dark (see what I did there?) family secret that I only learned when I was about 16 - my great great great grandfather was actually an African slave in England. Learning this was a strange thing for me, I didn't grow up in apartheid but I lived the change away from it. We should all be honest though that the vastly increasing amount of people alive in South Africa did not live apartheid, and yet the effects remain. What I did live with was the people who took their cues from their parents divided by race, most of us living as white people were raised with fairly racist parents. I recall the discussions around braais and spending most of my time as a kid in garages involved in motorbike racing I probably heard the worst of it - one of the people there even ended up in prison for trying to bomb a taxi rank. It's been difficult for all of us, some wound up respecting the boundaries of political correctness and others didn't. I only truly learned what the former meant when my father who seemed very angry about affirmative action went to England along with many other South African's - it was there that he traced his actual family and to me it seemed he may have learned that he was actually what he hated all this time.
My life changed after high school along with the rest of this country, and it became more and more obvious to me the times that people would pull me aside and seemingly hope that I would agree with their racism a lot like I've found drug addicts tend to seek validation of their choices. I will never forget one instance a few years ago in my favorite bakery, it was the time of the xenophobia riots in Durban and the old Indian lady behind the counter leaned over as I was leaving and said "Just be careful out there, you know these ka***rs are at it again". I was shocked, and confused. Why would she think that it was okay to be so openly racist a person you don't know just because they look white? After that experience something in my mind clicked, and I began to notice it all around me, with many of my friends and a massive percentage of people I worked with and got to know or even didn't know. How could this possibly be okay?
I don't know this Jessica person because I'm a programmer, I've never read FHM and outside of heroes of Science and technology I mostly only know about a few porn stars and politicians (which I guess are pretty similar). I guess that's really the point though, race divides us just like gender does, and sometimes hair color and respect for all things geeky. Just as Jessica's run in with whatever it was lead her to anger of a different paradigm neither instance was acceptable but was simultaneously a natural aspect of humanity, the reality we live with now, and we are all victims of a past that wasn't our own.
If you've never been cut off by a taxi and uttered or even for one microsecond thought a racist thing then go ahead and be mad, go ahead and be mad anyway if you like - but with such a complex problem, and so many hypocritical sinners out there I feel it's unfair and disproportionate to destroy yet another persons life and career over a stupid mistake, because that will only validate peoples hatred on either side of the argument. Was it acceptable? No. Is it forgivable? Yes.
Crane
There was a reason behind the waffling about Winch and Riak lately. Winch is just one part, the agent and collector, which is a framework to building an actual monitoring system capable of storing high-resolution data for lots of hosts. The more data points, the more scalability required of course. But at least this is getting somewhere, so I have an interface to dump stuff out for visualisation in Highcharts, yay.
The big problem however is that no matter how awesome your backend is, it's not feasible to store billions of datapoints without a serious amount of memory so the next step is rolling aggregation.
The theme is something I found a while ago, if it's yours feel free to demand attribution, it's probably just temporary though.
There, I fixed/broke it
So Django has not been a friend of mine due to the lack of OO in views. I don't know what psychological defect of mine makes me desire the comfort of having objects to pass around instead of methods but regardless I decided I'd just fix it instead of continuing to try and push square pegs into the round holes in my brain.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | class GenericView(object): def __init__(self, **kwargs): for k,v in kwargs.items(): setattr(self, k, v) def __call__(self, request, *args, **kwargs): path = request.path.rstrip('/').rsplit('/', 1)[-1] if path: method = 'render_' + path if method in dir(self): return getattr(self, method)(request) else: raise Http404 return self.render_index(request) def render_index(self, request): raise Http404 |
Then URL's can be mapped as url(r'^stuff/', 'app.views.whatever_instance'), and it hands the next child down to the render_$child method.
I promptly realized this doesn't actually win me anything over methods, except lazier URL mapping in a sort of Rails way, and a slightly cleaner looking codebase I guess.
ICASA’s arguments lack basic logic
In defense of SA's comparatively high mobile call rates ICASA's Pieter Grootes recently stated, “The greater the geographic coverage you have in a nation, the greater the retail price”. I've heard stupid things before, especially from ICASA, but this one takes the cake due to the total lack of very fundamental economic logic.
Lets say I have a shop that sells cakes. If I open a new shop somewhere else that sells the same cakes, why would my cost need go up? It wouldn't, I'm accessing more customers and selling more cakes - that was the entire reason for opening a new shop. Geographic coverage increases access to service so that people can continue making calls, this earns the mobile operator more money. If increasing coverage wasn't in the financial interest of operators they wouldn't be doing it. Bigger networks cost more money, but they also make more money. That's why boutique stores are more expensive than chain stores. If ICASA are actually claiming the reverse, then something very fundamental is broken with the legislation that they are responsible for. Economies of scale Mr Grootes, bigger networks should be cheaper per bit not more expensive.
Claiming that being able to rake in more consumers is an excuse for higher prices is the most ludicrous and hysterically ignorant thing I've heard in a very long time. ICASA's excuses for their own failures have started to become insulting, do they really think people are stupid enough to believe them?
Not content however with being downright illogical, they go on to shoot themselves in the foot “In South Africa there are some significant import and excise duties on such equipment and that’s a factor that will affect your wholesale input cost, and therefore your retail prices,”. Well now, who exactly is responsible for import duties? The government. The people that ICASA is a part of... Now, import duties are not ICASA's fault, and nor do they have the power to change import legislation, but it's not like they're doing their job either which should be to recommend that the DTI reduce import duties on equipment.
The grass is greener because it rains all the time
I'm a lucky person, if SA ever decides to go to war with Botswana, upset America, or something ridiculous I have can skip town and hide in the UK. There are countless occasions where people discover I'm a full-fledged UK citizen and also an SA one and ask me why I have not moved to the UK. I've always found this question strange, yes a big chunk of my family lives in the UK and I'm sure they're very happy there, but I certainly wouldn't be - I can barely survive more than a week in London. I think it's dark, cold, people aren't nice, and it feels like such a lonely claustrophobic life. Of course the excuses levied are quite annoying, these are the most common.
SA is a 3rd world country
How long is it going to take before people ditch this stupid attitude? What part of SA compares to Senegal? Go live in a real third world country or shut up. Also, go do some reading, "The term Third World arose during the Cold War to define countries that remained non-aligned with either capitalism and NATO" - SA is a first world country people, there's a map and everything right there on Wikipedia for you. With a GDP per capital of over $10,000 we're not even considered a "developing country".
I genuinely want to punch people and hit them with bricks when they call SA a 3rd world country. Seriously, stop pretending you're some down-trodden victim of affirmative action living in a slum with no hope of survival.
SA has crime
So does everywhere else. When everywhere else solves crime, please feel free to send your suggestions to the SAPS - I'm sure they'd welcome it. Or better yet, why not join the police force. Still crime is a problem, but it has actually gotten a lot better over the last 15 years since people started complaining about it. If you don't feel safe, or have been a victim of crime, I sympathise. I too have been a victim of crime, but I choose not to let it get to me.
Then again, the UK has crime too - quite a bit of it.
London has public transport.
If you want to argue public transport, please use Germany - yes I would go live in Germany in an instant, it's a fantastic place, their public transport is reasonably priced and insanely on-time and the people are awesome. I love Germany.
But back to London, it has public transport because the roads are tiny and having everyone in a car would bring the place to a standstill. Still if you've actually been to London (like I have, on several occasions) you might note that the public transport is FAR from cheap, it's also not very reliable. I constantly hear about people working from home because the trains weren't running or there were big delays or some such. If you're using the underground every day, you're looking at at least £100 per month. Yes, it's more expensive to run a car in SA and traffic isn't exactly stellar in Gauteng either, but really given the choice I'd take a car. The underground is hot, noisy, it's not that reliable, terrorists bomb it, and it's insanely busy during rush hour so you can quite possibly be delayed just by the fact that there wasn't enough space left on the train. Give me a car any day - except in London though, if you want to drive a car around you have to pay £8 per day just in congestion charge never-mind MOT and insurance etc, and the fact that their petrol is around £1.4 per liter - yes that's R17 vs our R12. Most notably though lots of places that aren't London don't have that much public transport, if you head over to somewhere in the British midlands it's pretty much exactly like SA, just wetter.
Salaries are higher
Yeah, they are, slightly and they need to be, or you'd actually die homeless. Personally, in my field I earn the same in Gauteng (if not slightly more after tax and expenses) than I would in London. Let me do the maths for you.
Lets assume someone earns R500,000 a year in SA, and you could get a job for £70,000 in London (although after the recession this is less likely)
In SA you will be taxed on a PAYE rate of 24.61%, and you'll pay some UIF of 1%, your net pay is R371,950.0 per year, throw some Discovery medical in at around R1600 a month to remove the aspect of health care and it's more like R352750 a year.
In London you will be taxed at a rate of 25.7%, and fork over £4781 a year for National insurance. After all this you clear £47,208.96.
Things are looking great, except lets find a place to live. I live in a 200 square-meter apartment in a pompous gated suburb near the heart of Sandton. It would be comparable to Zone 3 in London, except substantially better looking than anything there... All that luxury costs me a whole R8500 a month. You want the same place anywhere remotely near London? £600 a week please!
So where are we, right now the guy living in SA still has 269,950 of his Rands to play with buying cars and food and whatever. The sucker in London has £13608 left. Whatever way you cut it that's a lot less, in a place that's more expensive!
So if you want to go live in London, I wish you all the best of luck but take it from me the life isn't any better. If you do live in London, I bet you wish you were sitting in a warm sunny office staring at trees with a fantastic cheap restaurant below right about now like I am, don't you.
Email absolutely is broken
I read some surprising thoughts in a post Email is not broken: It's a framework, not an application. I'm not sure I understand any of the inference though. For starters, Email consists of several components. There's SMTP - the Internet protocol which moves messages between clients and servers, and between other servers. MIME - the standard by which different pieces of data are encoded and decoded by clients and/or servers. A mailbox protocol, POP3 and IMAP are good examples of very bad ideas so most large systems deploy their own (HTTP these days, used to be stuff like MAPI). SMTP and MIME however are not "frameworks", they are protocols and standards which have reached their life span. In my 10 years of operating email servers there have been all sorts of common questions from customers, "Why can't I send a 100MByte attachment", "Please stop people from forging my email address", "I keep getting emails about Viagra", "Why do emails from my aunt always appear cut off or broken". All of these questions are simultaneously frustrating in both their naivety but also their absolute validity. Email, or rather all the protocols that it is made up of, is horribly and fundamentally broken as well as unfriendly to users.
A bit of history about email. Back in the day (and I mean, really back in the day of punch cards and such) a vast majority of systems were character driven interfaces and could only support 7bits per character (pfft, who needs more than 127 characters, really, that's more than the keyboard has) so the SMTP protocol only supports data in the form of 7bit ASCI. This has never changed (read: been fixed), and theoretically it didn't need to be. This made SMTP useless for sending anything other than very simple plain text messages - hence the name.
That's where MIME came in, it provides a standard by which content of any type can be reduced to 7bit through Base64 encoding, and it's flexible enough to do quite a lot of cool things - the most important of which is to support character sets larger than 7bits for people in China or Russia for example.
The problem is that MIME isn't really adhered to in many ways, so clients tend to behave differently and or erratically when it comes to both parsing and generating MIME. I don't blame the clients for this, MIME is essentially a client-to-client protocol and neglects the servers in the way - and what people want to do with these servers now. I don't blame MIME or SMTP for this either, they were written in a vastly different world with very different problems to solve. Blaming Mr Borenstein for that would be as useful as blaming Casey Cowell for the fact that v56 modems don't support ADSL, or the creators of IPv4 for the fact that the world now has more people online. The problem is that MIME provides a standard for transmitting different media types by omits how they should be rendered in many contexts. Most clients will have a plain text version of an email sent with whatever other MIME encoded content, there's no guarantee that the plain text version will be how you expect it or even that the person will read either the plain text or rich text versions. What happens if you send a mail from Outlook to someones Android phone? It might look the same, who knows. This is made infinitely worse, not by the MIME standard, but by the flaky HTML standards of old, and now even CSS is being embedded into mails. Outlook was making use of IE to render broken HTML, the rest of the world wasn't. Hilarity ensues.
The fact of the matter is that it's 30+ years later, and it's time to recognize obsoleteness. For starters, SMTP has no trust mechanism so we have mountains of unsolicited email going around. The biggest problem however is that most people share files through email, and they need features like auto-responders and stationary because that's what marketing people and lawyers tell them they need. SMTP is completely inefficient for this purpose, any file transferred through SMTP requires double the number of bits that any other transfer protocol requires because it doesn't support binary. This is easy to demonstrate
1 2 | >>> base64.b64encode(u':)') 'Oik=' |
So that's 4 bytes instead of 2 just to send a smily face [Fun fact: In Outlook a smily face is encoded as a Wingdings character which consumes a ton of formatting that isn't supported by other clients and often appears as 'J']. Clearly SMTP is the worst way possible to transmit non-ascii information around the internet.
Then we have the simple issue of security. There have been many ways that people, email clients and some service providers have tried to create ways to send SMTP messages securely around the internet. The first of these is PGP or GPG, which entails sharing a public key with the recipient and then sending them encrypted content which can be decrypted with their private key. Mostly PGP was used to validate the sender (a feature that SMTP lacks entirely since the sender can be forged quite convincingly numerous ways) because people prefer security to be more implicit and less invasive, people are far too lazy to learn about asymmetric encryption and there's large voids in how keys should be shared which leads to inherent insecurity anyway. The other thing is TLS for SMTP, which don't get me wrong is certainly useful in limited ways - you *do* want your road-warrior clients who use open wifi hotspots to make use of TLS to send email to your corporate server, however security through the SMTP protocol can't be guaranteed.
Sure you can only accept mail via TLS, but you'd fail to receive lots of mail from systems that don't support it or even from systems with older SSL libraries, similarly if you were to refuse to send mail to people who don't support TLS. The issue is even if you accepted mail via TLS or sent it via TLS there is no way to guarantee how it is relayed later - for all you know it could be printed on paper and handed to the recipient, in which case the secure connection at any point was a total waste of time.
But there are even more fundamental layers that email is broken on, my biggest gripe being MX records. An MX record is the DNS record which tells a persons SMTP server where the other persons SMTP server is. MX records can take multiple servers at different priorities - the priority actually selected is not guaranteed, there are many MX configurations which have a completely undefined behavior under the RFC's which is why many people make "rookie" mistakes like putting an IP address into an MX record, or a CNAME into an MX record. Greylisting makes this ever more complicated where if the first server issues a 500 in order to make the remote sending server back off for a while, it will actually just hit the next MX record very quickly, the result is that greylisting results in more massive delays than the initial since after rejection from all servers the mail in the remote queue is at retry n where n is the number of MX records. A 300 second greylist delay could be a 3 hour remote queue delay and a total waste of resources. On top of that many people don't understand TTL's that come with DNS, so they go changing their MX records without reducing their TTL before hand and they get angry when they lose emails. This is even worse when you see people using dynamic DNS - the guy who got your IP during in the delay between you updating your dynamic DNS could happily be receiving your email while no one else is the wiser!
In the history of internet protocols, none make use of a new DNS record type in order to coordinate their operation. Jabber and a few others try using SRV records, but most agree it's a horrible idea. Even more horrible than ideas like SPF which attempts to fix the sender validation that SMTP lacks, the problem with SPF is it's an all or nothing problem. If half the internet has no SPF records then no one else should bother, because not having an SPF record doesn't invalidate the sender, and having an SPF record doesn't guarantee that anyone else's SMTP server cares so they'll happily forward on forged email for your domain. This is a minor gripe, but I think valid.
So to summarise:
- MIME provides a standard for message formatting and binary encoding, it does not guarantee how your mail will be rendered on any other client, or the capability of that client (think smartphones or even dumb phones).
- SMTP solves far too many old problems which make it very inefficient in the modern world
- SMTP is actually very insecure, while pretending not to be.
- SMTP can be forged
- POP3 and IMAP are hideous (seriously, just take my word as someone who has implemented clients for both protocols)
- Lets not even talk about even more horrible things like UUCP
So in conclusion I would say no, client's can't fix the problems with email, it's time to stop trying to hack the "framework" into working and just make a better one. The logistics of that however makes it as unlikely as people throwing their Fax machines and printers away.
Bootstrapping EC2 instances into Puppet
So the whole point of things like Puppet is to be able to automate deploying machines in environments like Amazon EC2, but there's a few bootstrapping issues that get in the way of this. Most of all for me is the fact that Puppet authorizes machines based on their hostname, and conflicting with that is the EC2 idea that the hostname is apparently irrelevant. Some people are okay with that perhaps, and using facts along with mcollective is a decent way of orchestrating things too such that you don't need to care about the machines hostname in the long run.
That irritates me though, so I went against the seeming conventions and did something else. That was to use the instance userdata as a boilerplate to configure a machine internally to connect to puppet. This requires a simple AMI with the following /root/bootstrap.sh script and call it from /etc/rc.local.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | #!/bin/bash if [ -f /root/.initialised ]; then exit 0 fi FQDN=`/usr/bin/curl -s http://169.254.169.254/latest/user-data | grep hostname | sed 's/.*=//' | sed 's/ //'` HOSTNAME=`echo ${FQDN} | awk -F"." '{print $1}'` if [ "$FQDN" == "" ]; then echo "No hostname found in user metadata" exit 0 fi echo $FQDN > /etc/hostname hostname $FQDN echo "server=puppet.imcol.in" >> /etc/puppet/puppet.conf echo "listen=true" >> /etc/puppet/puppet.conf echo "127.0.0.1 localhost $HOSTNAME $FQDN" > /etc/hosts apt-get -y install puppet sed -i /etc/default/puppet -e 's/START=no/START=yes/' /etc/init.d/puppet start echo "#!/bin/sh -e" > /etc/rc.local echo "exit 0" >> /etc/rc.local touch /root/.initialised |
This pulls the userdata for the instance and looks for the hostname and reconfigures the server with a sane hostname, then starts puppet - yay. Now on the puppet master you can set auto-signing for your domain and then bringing up a new server is just a case of launching a new instance with the appropriate hostname for the service.
One other curious thing though is that other information out there omits the fact that you can specify a server in the puppetd configuration, instead they're all slowly building themselves into a corner by manually creating records for 'puppet' in /etc/hosts. I'm not sure why you'd want to manage 1000 hosts files when the IP of your Puppet master changes.
A bit about Winch
If you noticed I'm putting a lot into Riak at the moment, that's because I figure it's a great place to store monitoring data. Currently I'm working on a little monitoring concept called Winch, what this aims at is creating a highly scalable monitoring platform which is capable of storing information at higher resolutions than existing systems. A few similar ideas and systems already exist, but if man were happy stopping at the wheel we wouldn't have rocket ships.
First of all Winch doesn't do any form of server side polling (it will need to at some point logically for external checks, but that's not the important bridge), the client agent has its own scheduler and SQLite message queue. The scheduler retrieves the check schedule from the server, then runs plugins based on the configured frequency. Check results become JSON documents which are shoved into a SQLite table which is then cleared out by pushing data to the server. The benefit of this is that check data isn't lost if the server goes away, with the caveat that if the server is away for a long time we might kill the client with backlog and cause a thundering herd when the server returns (this probably needs a push back-off and expiry to keep things sane later).
Once data hits the server each check result gets put into a bucket for that host and check combination, and a trigger function gets called for that check.
This is where things borrow slightly from Zabbix but return some sanity and reducing complexity by giving triggers to a 1:1 relationship with a check much like Nagios does.
Assuming we have some load value check producing data points on a 5 second interval this gives us the following check data in a Riak bucket.
1 2 3 4 5 6 | {u'target': u'd3c36ca27d8511e1b4830800278663d1', u'plugin_time': 0.029787063598632812, u'plugin_name': u'load', u'load1': 1.87, u'plugin_calltime': 1333569143.757328, u'load3': 1.9, u'load5': 1.31} {u'target': u'd3c36ca27d8511e1b4830800278663d1', u'plugin_time': 0.029787063598632812, u'plugin_name': u'load', u'load1': 1.87, u'plugin_calltime': 1333569143.757328, u'load3': 1.9, u'load5': 1.31} {u'target': u'd3c36ca27d8511e1b4830800278663d1', u'plugin_time': 0.0676109790802002, u'plugin_name': u'load', u'load1': 1.85, u'plugin_calltime': 1333569138.756273, u'load3': 1.89, u'load5': 1.31} {u'target': u'd3c36ca27d8511e1b4830800278663d1', u'plugin_time': 0.04973483085632324, u'plugin_name': u'load', u'load1': 1.93, u'plugin_calltime': 1333569133.755937, u'load3': 1.91, u'load5': 1.31} {u'target': u'd3c36ca27d8511e1b4830800278663d1', u'plugin_time': 0.04573202133178711, u'plugin_name': u'load', u'load1': 1.92, u'plugin_calltime': 1333569128.762157, u'load3': 1.91, u'load5': 1.31} {u'target': u'd3c36ca27d8511e1b4830800278663d1', u'plugin_time': 0.040303945541381836, u'plugin_name': u'load', u'load1': 1.91, u'plugin_calltime': 1333569123.755218, u'load3': 1.91, u'load5': 1.3} |
The nice thing about decoupling the expectation that checks should complete on schedule is we can rather work on fuzzy windows. The Winch scheduler monitors the epoch time the check plugin was called, and the time it took to complete. This gives enough data to profile the efficiency of a plugin and also estimate the time window that the datapoint refers to.
Currently the trigger function language is a restricted Python eval which has 3 methods - average, last and range. 'Last' simply retrieves the last n values by creating a time window based on the known check frequency, because there is no concern about stacking late checks (a side effect of trying to centralise polling schedules and killing late jobs) we know we will always get at least within 1 datapoint of the desired number of points, so the 'last' method simply adds an extra time window and then cuts excess points later. Since points are retrieved on time it's guaranteed that they must fall into the most recent window rather than stale data triggering false alarms.
The benefit of keeping constant data available is that instead of alerting based on instantaneous values it's possible to go back to old data to compare current data to to base our alert decisions on a changing trend while knowing that age is consistent. The common issue with monitoring load values is that servers have different load profiles, so rather than picking a value that might be high for some systems but normal for others, we can compare the current values to what has been seen previously and detect abnormalities.
1 | average(last('load1', 3)) > 10*average(last('load1', 100)[5:]) |
This takes the last 3 load values and triggers if their average becomes more than 10 times higher than the values between the last 5 and the last 100.
This is naturally a slower process than plugins setting their alert state but the above trigger which makes two index calls to Riak for 103 values runs in 300ms on a single core VM with no clustering, which is reasonable.

