Analyzing Popular Topics In My Twitter Timeline using Apache Spark

Word cloud of Twitter hashtags
Most popular Twitter topics, generated using Apache Spark and

Over the last weeks I’ve dived into data analysis using Apache Spark. Spark is a framework for efficient, distributed analysis of data, built on the Hadoop platform but with much more flexibility than classic Hadoop MapReduce. To showcase some of the functionality I’ll walk you through an analysis of Twitter data. The code is available as an IPython Notebook on Github.

The question I want to answer using Spark is: What topics are people currently tweeting about? The people are in this case the ones I follow on Twitter, at the moment approx. 600 Twitter users. They represent a diverse set of interests mirroring the topics I’m interested in, such as data analysis, machine learning, networking technology, infrastructure, motor sports and so on. By extracting the hashtags they’ve used in tweets the last week and do a standard word count I’ll generate a list of the most popular topics right now.

The amount of functionality for data analysis in Spark is impressive. Spark features a long list of available transformations and actions, such as map, filter, reduce, several types of joins, cogroup, sum, union, intersect and so on. In addition, Spark has a machine learning library with a growing number of models and algorithms. For instance does Spark MLlib include everything needed to do the Linear Regression example I did on AWS in my previous blog post. Spark also comes with a Streaming component where batch analysis pipelines easily can be set up to run as realtime analysis jobs instead. Compared to classic Hadoop MapReduce Spark is not only more flexible, but also much faster thanks to the in-memory based analysis.

To do the Twitter analysis, I first fetched about 24000 tweets from the Twitter API using a Python module called Tweepy:

Each tweet was saved to a local MongoDB instance for persistence. The loop first checks the database to see if that user has been processed already, to save time if the loop has to be run several times. Due to rate limiting of the Twitter API it took about 2 hours to download the dataset. By the way, the term “friends” is the word Twitter uses to reference the list of users that a user follows.

The code snippet above depends on a valid, authorized API session with the Twitter API and an established connection to MongoDB. See the IPython Notebook for the necessary code to establish those connections. Of the dependencies the Python modules “tweepy” and “pymongo” need to be installed, preferably using pip to get the latest versions.

With the tweets saved in MongoDB, we are ready to start doing some filtering and analysis on them. First, the set of tweets need to be loaded into Spark and filtered:

In the code snippet above I use sc.parallelize() to load a Python list into Spark, but I could just as easily have used sc.textfile() to load data from a file on disk or sc.newAPIHadoopFile() to load a file from HDFS. Spark also supports use of Hadoop connectors to load data directly from other systems such as MongoDB, but that connector unfortunately does not support PySpark yet. In this case the dataset fits in memory of the Python process so I can use sc.parallelize() to load it into Spark, but if I’d like to run the analysis on a longer timespan than one week that would not be feasible. To see how the MongoDB connector can be used with Python, check out this example code by @notdirkhesse which he demonstrated as part of his excellent Spark talk in June.

“sc” is the SparkContext object, which is the object used to communicate with the Spark API from Python. I’m using a Vagrant box with Spark set up and “sc” initialized automatically, which was provided as part of the very interesting Spark MOOCs CS100 and CS190 (BerkeleyX Big Data X-Series). The SparkContext can be initialized to use remote clusters running on EC2 or Databricks instead of a local Vagrant box, which is how you’d scale out the computations.

Spark has a concept of RDDs, Resilient Distributed Datasets. RDDs represent an entire dataset regardless of how it is distributed around on the cluster of nodes. RDDs are immutable, so a transformation on a RDD returns a new RDD with the results. The last two lines of the code snippet above are transformations to filter() the dataset. An important point to note about Spark is that all transformations are lazily evaluated, meaning they are not computed until an action is called on the resulting RDD. The two filter statements are only recorded by Spark so that it knows how to generate the resulting RDDs when needed.

Let’s inspect the filters in a bit more detail:

tweetsWithTagsRDD = allTweetsRDD.filter(lambda t: len(t['entities']['hashtags']) > 0)

The first filter transformation is called on allTweetsRDD, which is the RDD that represents the entire dataset of tweets. For each of the tweets in allTweetsRDD, the lambda expression is evaluated. Only those tweets where the expression equals True is returned to be included in tweetsWithTagsRDD. All other tweets are silently discarded.

filteredTweetsRDD = tweetsWithTagsRDD.filter(lambda t: time.mktime(parser.parse(t['created_at']).timetuple()) > limit_unixtime)

The second filter transformation is a bit more complex due to the datetime calculations, but follows the same pattern as the first. It is called on tweetsWithTagsRDD, the results of the first transformation, and checks if the tweet timestamp in the “created_at” field is recent enough to be within the time window I defined (one week). The tweet timestamp is parsed using python-dateutil, converted to unixtime and compared to the precomputed limit.

For those of you who are already acquainted with Spark, the following syntax might make more sense:

filteredTweetsRDD = (allTweetsRDD
                     .filter(lambda t: len(t['entities']['hashtags']) > 0)
                     .filter(lambda t: time.mktime(parser.parse(t['created_at']).timetuple()) > limit_unixtime)

The inspiration from Functional Programming in Sparks programming model is apparent here, with enclosing parentheses around the entire statement in addition to the use of lambda functions. The resulting filteredTweetsRDD is the same as before. However, by assigning a variable name to the results of each filter transformation, it’s easy to compute counts:

tweetCount = allTweetsRDD.count()
withTagsCount = tweetsWithTagsRDD.count()
filteredCount = filteredTweetsRDD.count()

count() is an example of an action in Spark, so when I execute these statements the filter transformations above are also computed. The counts revealed the following about my dataset:

  • Total number of tweets: 24271
  • Tweets filtered away due to no hashtags: 17150
  • Of the tweets who had hashtags, 4665 where too old
  • Resulting set of tweets to analyze: 2456

Now we’re ready to do the data analysis part! With a filtered set of 2456 tweets in filteredTweetsRDD, I proceed to extract all hashtags and do a word count to find the most popular tags:

What’s happening here is that I’m creating a new Pair RDD consisting of tuples of (hashtag, count). The first step is to extract all hashtags with a flatMap(), and remember that every tweet can contain a list of multiple tags.

filteredTweetsRDD.flatMap(lambda tweet: [hashtag['text'].lower() \
    for hashtag in tweet['entities']['hashtags']])

A flatMap() transformation is similar to a map(), which passes each element of a RDD through a user-supplied function. In contrast to map, flatMap ensures that the result is a list instead of a nested datastructure – like a list of lists for instance. Since the analysis I’m doing doesn’t care which tweet has which hashtags, a simple list is sufficient. The lambda function does a list comprehension to extract the “text” field of each hashtag in the data structure and lowercase it. The data structure for tweets looks like this:

{u'contributors': None,
 u'coordinates': None,
 u'created_at': u'Sun Jul 12 19:29:09 +0000 2015',
 u'entities': {u'hashtags': [{u'indices': [75, 83], 
                              u'text': u'TurnAMC'},
                             {u'indices': [139, 140], 
                              u'text': u'RenewTURN'}],
               u'symbols': [],
               u'urls': [],

So the result of the lambda function on this tweet would be:

['turnamc', 'renewturn']

After the flatmap(), a standard word count using map() and reduceByKey() follows:

.map(lambda tag: (tag, 1))
.reduceByKey(lambda a, b: a + b)

A word count is the “Hello World”-equivalent for Spark. First, each hashtag is transformed to a key-value tuple of (hashtag, 1). Second, all tuples with the same key are reduced using the lambda function, which takes two counts and returns the sum. Spark runs both map() and reduceByKey() in parallel on the data partition residing on each worker node in a cluster, before the results of the local reduceByKey() are shuffled so that all values belonging to a key is processed by one worker. This behaviour mimics the use of a Combiner in classic Hadoop MapReduce. Since both map() and reduceByKey() are transformations, the result is a new RDD.

To actually perform the computations and get results, I call the action takeOrdered() with a cursom sort function to get the top 20 hashtags by count. The sort function simply orders key-value pairs descending by value.

The 20 most used hashtags in my dataset turned out to be:

[(u'bigdata', 114),
 (u'openstack', 92),
 (u'gophercon', 71),
 (u'machinelearning', 68),
 (u'sdn', 66),
 (u'datascience', 58),
 (u'docker', 56),
 (u'dtm', 46),
 (u'audisport', 44),
 (u'dtmzandvoort', 42),
 (u'hpc', 40),
 (u'welcomechallenges', 38),
 (u'devops', 37),
 (u'analytics', 36),
 (u'awssummit', 36),
 (u'infosec', 33),
 (u'security', 32),
 (u'openstacknow', 29),
 (u'renewturn', 29),
 (u'mobil1scgp', 28)]

In this list it’s easy to recognize several of the interests I mentioned earlier. Big Data is the top hashtag, which together with Machine Learning and Data Science make up a significant portion of the interesting tweets I see in my Twitter timeline. OpenStack is another top hashtag, which is a natural topic given my current job in infrastructure. SDN is a closely related topic and an integrated part of the OpenStack scene. Docker is taking over in the infrastructure world and the DevOps mindset that follows with it is also a popular topic.

What’s interesting to see is that conferences spark a lot of engagement on Twitter. Both GopherCon and AWS Summit turn up in the top 20 list since they took place during the week of tweets I analyzed. The same goes for motor sports (hashtags DTM Zandvoort, Audi Sport, Welcome Challenges), although in that case it’s the professional teams, in contrast to conference goers, that make sure their Twitter followers are constantly updated on standings and news.

As I’m sure you’ve noticed, the word cloud at the beginning of this blog post is generated from the list of hashtags in the dataset and their counts. To finish off the analysis I also computed the average number of hashtags per tweet that had at least one hashtag:

# Count the number of hashtags used
totalHashtags = (key, value): value) \
                         .reduce(lambda a, b: a + b)

# Compute average number of hashtags per tweet
print('A total of {} hashtags gives an average number of ' +
      'tags per tweet at {}.'.format(totalHashtags, 
      round(totalHashtags/float(filteredTweetsRDD.count()), 2)))

Here I do another map + reduce, but this time the map function extracts the count for each hashtag and the reduce function sums it all up. It is very easy to build such pipelines of transformations to get the desired results. The speed and flexibility of Spark lowers the entry point and invites the user to do more such computations. In closing, here are the numbers:

  • A total of 4030 hashtags analyzed
  • An average number of tags per tweet at 1.64

~ Arne ~


Using Amazon Machine Learning to Predict the Weather

Amazon recently launched their Machine Learning service, so I thought I’d take it for a spin. Machine Learning (ML) is all about predicting future data based on patterns in existing data. As an experiment I wanted to see if machine learning would be able to predict the weather of tomorrow based on weather observations. Weather systems travel large distances on a time scale of hours and days, so recent weather observations from around the country can be used to predict the future weather of one specific site. Meteorological institutes do this every day by running complex weather models on hundreds of nodes in large HPC clusters. I don’t expect machine learning to produce quite as good results as those models do, but thought it would be fun to see how close ML could get.

Weather Map
Weather forecast from, delivered by the Norwegian Meteorological Institute and the NRK

The Amazon Machine Learning service makes it easy to get started and reduces the time it takes to get actionable insights from data. The service comes with tutorials, developer guides, a very useful explanation of machine learning concepts and enough tips to guide anyone through their first steps in the world of machine learning. There is a sample dataset you can use to create your first prediction model but if you want, you can follow along my journey in this post with my dataset instead. The source code to generate it is on Github and all you need to generate CSV files with weather observations is a free API key from

Defining the use case and dataset

Before diving into coding and machine learning, it’s important to define the use case as clearly as possible. To test whether Machine Learning is a viable approach to weather forecasting is the overall goal. To test this, I choose to predict the temperature tomorrow at 12:00 UTC in Oslo, the capital of Norway. The dataset I’ve chosen is weather observations from five cities in Norway, scattered around the southern half of the country. The weather in Oslo usually comes from the west, so I include observations from cities like Stavanger and Bergen in the dataset.

The layout of the dataset is important. Amazon Machine Learning treats every line in the dataset (CSV file) as a separate record and processes them randomly. For each record, it tries to predict the target value (temperature in Oslo the next day) from the variables present in that record. This means that you can’t rely on any connections between records, for example that the temperature measured at 10 AM may be similar to the temperature at 11 AM.

To create a dataset with enough data in each record to be able to predict the target value, I append all weather observations with the same timestamp, regardless of location, to the same record. This means that for any given timestamp there will for instance be five temperatures, five wind measurements and so on. To be able to distinguish between different cities in the dataset I named each column (each variable) with the first letter of the city name, forming variable names like “o_tempm” when the original observational data had a variable “tempm” containing the temperature for Oslo.

Creating the training dataset

Machine Learning works by creating a model from a training dataset where the target value to predict is already known. Since I want to predict a numerical value, Amazon ML defaults to a linear regression model. That is, it tries to build a formula which can output the target value, using individual weights for each variable in a record that tells the model how that variable is related to the target value. Some variables get weight zero, meaning they are not related to the target value at all, and others get positive weights between 0 and 1. To be able to determine weights for variables, there must be a sufficient amount of training data.

To create a sufficiently large training dataset, I needed weather observations for some time, at least 14 days. Fortunately, has a JSON API that is really easy to use, and their history endpoint can provide weather observations for both the current date and dates back in time. I use that to collect observations for the last two weeks for all five cities. Since the free tier of their API restricts the use to 500 calls a day and maximum of 10 calls per minute, the script I made to generate the dataset has to wait some seconds between each API call. To limit the API usage and be able to rerun the script I cache weather observations on disk, because I don’t expect past weather observations to change.

A weather observation returned by their API has the following syntax:

{'conds': 'Light Rain Showers',
 'date': {'hour': '14',
 'mday': '29',
 'min': '00',
 'mon': '05',
 'pretty': '2:00 PM CEST on May 29, 2015',
 'tzname': 'Europe/Oslo',
 'year': '2015'},
 'dewpti': '35',
 'dewptm': '2',
 'fog': '0',
 'hail': '0',
 'heatindexi': '-9999',
 'heatindexm': '-9999',
 'hum': '36',
 'icon': 'rain',
 'metar': 'AAXX 29121 01384 11786 52108 10121 20015 51007 60001 78082 84260',
 'precipi': '',
 'precipm': '',
 'pressurei': '',
 'pressurem': '',
 'rain': '1',
 'snow': '0',
 'tempi': '54',
 'tempm': '12',
 'thunder': '0',
 'tornado': '0',
 'utcdate': {'hour': '12',
 'mday': '29',
 'min': '00',
 'mon': '05',
 'pretty': '12:00 PM GMT on May 29, 2015',
 'tzname': 'UTC',
 'year': '2015'},
 'visi': '37',
 'vism': '60',
 'wdird': '210',
 'wdire': 'SSW',
 'wgusti': '',
 'wgustm': '',
 'windchilli': '-999',
 'windchillm': '-999',
 'wspdi': '17.9',
 'wspdm': '28.8'}

As you can see, this is the observation for May 29th at 12:00 UTC and the temperature was 12 degrees Celsius (“tempm”, where m stands for the metric system). The API returns a list of observations for a given place and date. The list contains observation data for at least every hour, some places even three times per hour. For each date and time, I combine the observations from all five places into one long record. If there isn’t data from all five places for that timestamp, I skip it, to make sure that all machine learning records have a sufficient amount of data. Lastly, for all records belonging to the training set, I append the target value (the next day’s temperature) as the last field. That way, I get a file “dataset.csv” with known target values that can be used to train the model. Here is an example record with the last number (8) being the target to predict:


This way, I get a dataset where each timestamp is a separate record and all timestamps belonging to one date gets the next day’s temperature in Oslo at 12 UTC as the target value. In total 844 records for 14 days of observations.

In addition, the script outputs a file “testset.csv” with the last day of observations where the target value is unknown and should be predicted by the model.

Upload both CSV files to Amazon S3 before continuing, as Amazon Machine Learning is only able to use input data from S3 (or Redshift, but in that case Amazon exports it to S3 before using it). Be sure to select the “US Standard” region of S3, as the Machine Learning service is only available in their North Virginia location at the moment. To reduce costs it is important that the S3 datasets are in the same region as the Machine Learning service.

Create Amazon Machine Learning datasources

The datasource is an object used by the Machine Learning model to access the data in S3. Datasource objects contain a schema that tells the model what type of field each variable is (numeric, binary, categorical, text). This schema is autodetected when you create a datasource, but may need some review before continuing to make sure all field types were classified correctly.

To create a datasource, first log into AWS to get access to Amazon Machine Learning. Select Create New -> Datasource. You are then asked for the path to the dataset in S3 and to give it a name, for instance “Weather observations”:

Create datasource

The dataset is verified and the schema is auto-generated. The next step is to make any adjustments to the schema if needed. Remember to say Yes to “Does the first line in your CSV contain the column names?”. I found that most fields containing actual data were correctly categorized, but fields with little or no data were not. There are some observation fields, like for instance “tornado”, that are normally zero for all five places and all times in my dataset. That field is binary but often autodetected as numeric (probably not an issue, since the field has no relevant data). The field “precipm” is numeric, but as it’s often blank (no precipitation detected) it can be mislabeled as categorical. Remember to go through all variables to check for misdetections like these, in my dataset there is 97 variables to check.

The third step is to define a target, which is the variable “target_o_tempm” in my dataset. When selected, Amazon Machine Learning informs you that “ML models trained from this datasource will use Numerical regression.” The last step in datasource creation is to define what field will be used as row identifier, in this case “datetime_utc”. The row identifier will be used to label output from the machine learning model, so it’s handy to use a value that’s unique to each record. The field selected as row identifier will be classified as Categorical. Review the settings and click Finish. It may take some minutes for the datasource to get status Completed, since Amazon does quite a bit of data analysis in the background. For each variable, the range of values is detected and scores like mean and median are computed.

At this point, I suggest you go ahead and create another datasource for the testset while you wait. The testset needs to have the same schema as the dataset used to train the model, which means that every field must have the same classification in both datasources. Therefore it’s smart to create the testset datasource right away when you still remember how each variable should be classified.

In my case I got 30 binary attributes (variables), 11 categorical attributes and 56 numeric attributes. The review page for datasource creation lists that information. Make sure the numbers match for the dataset and testset:

Create datasource - review page

Creating the Machine Learning model

So, now you’ve got two datasources and can initialize model training. Most of the hard work in creating a dataset and the necessary datasources is already done at this point, so it’s time to put all this data to work and initialize creation of the model that is going to generate predictions.

Select Create New -> ML Model on the Machine Learning Dashboard page. The first step is to select the datasource you created for the dataset (use the “I already created a datasource pointing to my S3 data” option). The second step is the model settings, where the defaults are just fine for our use. Amazon splits the training dataset into two parts (a 70-30 split) to be able to both train the model and evaluate its performance. Evaluation is done using the last 30% of the dataset. The last step is to review the settings, and again, the defaults are fine. The advanced settings include options like how many passes ML should do over the dataset (default 10) and how it should do regularization. More on that below, just click Finish now to create the model. This will again take some time, as Amazon performs a lot of computations behind the scenes to train the model and evaluate its performance.

Regularization is a technique used to avoid overfitting of the model to the training dataset. Machine learning models are prone to both underfitting and overfitting problems. Underfitting means the model has failed at capturing the relation between input variables and target variable, so it is poor at predicting the target value. Overfitting also gives poor predictions and is a state where the model follows the training dataset too closely. The model remembers the training data instead of capturing the generic relations between variables. If for instance the target value in the training data fluctuates back and forth, the model output also fluctuates in the same manner. The result is errors and noise in the model output. When used on real data where the target value is unknown, the model will not be able to predict the value in a consistent way. This image helps explain the issue in a beautiful way:

Machine learning - overfitting
Underfitting vs overfitting a machine learning model. Image source:

So to avoid overfitting, regularization is used. Regularization can be performed in multiple ways. Common techniques involve penalizing extreme variable weights and setting small weights to zero, to make the model less dependent on the training data and better at predicting unknown target values.

Exploring model performance

Remember that Amazon did an automatic split of the training data, using 70% of the records to train the model and the remaining 30% to evaluate the model. The Machine Learning service computes a couple of interesting performance metrics to inspect. Go to the Dashboard page and select the evaluation in the list of objects. The first metric shown is the RMSE, Root Mean Square Error, of the evaluation training data. The RMSE should be as low as possible, meaning the mean error is small and the predicted output is close to the actual target value. Amazon computes a baseline RMSE based from the model training data and compares the RMSE of the evaluation training data to that. In my testing I archieved an RMSE of about 2.5 in my first tests and near 2.0 after refining the dataset a bit. The biggest optimization I did was to change the value of invalid weather observations from the default value of -999 or -9999 to be empty. That way the range of values for each field got more close to the truth and did not include those very low numbers.

By selecting “Explore model performance”, you get access to an histogram showing the residuals of the model. Residuals are differences between predicted target and actual value. Here’s the plot for my model:

ML model performance histogram
Distribution of residuals

The histogram tells us that my model has a tendency to over-predict the temperature, and that a residual of 1 to 2 degrees Celcius is the most likely outcome. This is called a positive bias. To lower the bias it is possible to re-train the ML model with more data.

Before we use the model to predict temperatures, I’d like to show some of the interesting parts of the results of model training. Click the datasource for the training dataset on the Dashboard page to load the Data Report. Select Categorical under Attributes in the left menu. Sort the table by “Correlations to target” by clicking on that column. You’ll get a view that looks like this:

Datasource - Data Report Categorical

This table tells you how much weight each field has in determining the target value, so it is a good source of information on what the most important weather observation data are. Of the categorical attributes, the wind direction in Stavanger is the most important attribute for how the temperature is going to be in Oslo the next day. That makes sense since Stavanger is west of Oslo, so weather that hits Stavanger first is likely to arrive to Oslo later. Wind direction in Kristiansand is also important and in third place on this ranking we find the conditions in Trondheim. The most commonly observed values is shown along with a small view of the distribution for each variable. Click Numeric to see a similar ranking for those variables, revealing that dew point temperature for Trondheim and atmospheric pressure in Stavanger are the two most important numeric variables. For numeric variables, the range and mean of observed values is shown. It’s interesting to see that none of the mentioned important variables are weather observations from Oslo.

Use the model to predict the weather

The last step is to actually use the Amazon Machine Learning service to predict tomorrow’s temperature, by feeding the testset to the model. The testset contains one day of observations and an empty target value field.

Go back to the Dashboard and select the model in the object list. Click “Generate batch predictions” at the bottom of the model info page. The first step is to locate the testset datasource, which you’ve already created. When you click Verify Amazon checks that the schema of the testset matches the training dataset, which it should given that all variables have the same classification in both datasources. It is possible to create the testset datasource in this step instead of choosing an existing datasource. However, when I tried that, the schema of the testset was auto-generated with no option for customizing field classifications, so therefore it failed verification since the schema didn’t match the training dataset. That’s why I recommended you create a datasource for the testset too, not just for the training dataset.

After selecting the testset datasource, the last step in the wizard before review is choice of output S3 location. Amazon Machine Learning will create a subfolder in the location you supply. Review and Finish to launch the prediction process. Just like some of the previous steps, this may take some minutes to finish.

The results from the prediction process is saved to S3 as a gzip file. To see the results you need to locate the file in the S3 Console, download it and unpack it. The unpacked file is a CSV file, but lacks the “.csv” suffix, so you might need to add that to get the OS to recognize it properly. Results look like this:


The “score” field is the predicted value, in this case the predicted temperature in Celsius. The tag reveals what observations resulted in the prediction, so for instance the observations from the five places for timestamp 02:00 resulted in a predicted temperature in Oslo the next day at 12:00 UTC of 12.7 degrees Celsius.

As you might have noticed, we’ve now got a bunch of predictions for the same temperature, not just one prediction. The strength in that is that we can inspect the distribution of predicted values and even how the predicted value changes according to observation timestamp (for example to check if weather observations from daytime give better predictions than those from the night).

Distribution of predicted temperature values

Predicted Temperature Distribution

Even though the individual predictions differ, there seems to be strong indications that the value is between 13.0 and 14.5 degrees Celsius. Computing the mean over all predicted temperatures gives 13.6 degrees Celsius.

Development of predicted temperature values

Predicted Temperature Development

This plot shows the development in the predicted value as the observation time progresses. There does not seem to be any significant trend in how the different observation times perform.

The actual temperature – and some closing remarks

At this point I’m sure you wonder what the real temperature value ended up being. The value I tried to predict using Amazon Machine Learning turned out to be:

12:00 UTC on May 31, 2015:  12 degrees Celsius

Taking into account the positive bias of 1 to 2 degrees and a prediction mean value of 13.6 degrees Celsius, I am very satisfied with these results.

To improve the model performance further I could try to reduce the bias. To do that I’d have to re-train the model with more training data, since two weeks of data isn’t much. To get a model which could be used all year around, I’d have to include training data from a relevant subset of days throughout the year (cold days, snowy days, heavy rain, hot summer days and so on). Another possible optimization is to include more places in the dataset. The dataset lacks places to the east of Oslo, which is an obvious flaw. In addition to more data, I could explore if the dataset might need to be organized differently. It is for instance possible to append all observations from an entire day into one record, instead of creating a separate record for each observation timestamp. That would give the model more variables to use for prediction but then only one record to predict the value from.

It has been very interesting to get started with Amazon Machine Learning and test it with a real-life dataset. The power of the tool combined with the ease of use that Amazon has built into the service makes it a great offering. I’m sure I’ll use the service for new experiments in the future!

~ Arne ~

Making the Network Ready for 40GbE to the Server

In today’s server networks, 10GbE has become commonplace and has taken over for multiple 1GbE links to each server. However, for some workloads, 10GbE might not be enough. One such case is OpenStack network nodes which potentially handle a big part of the traffic in and out of an OpenStack cloud (depending on how it’s configured). When faced with such use cases, how should the network prepare for delivering 40GbE* to servers?

The issue here is the number of available 40GbE ports on data center switches and the cost of cabling. The most cost-effective cabling for both 10GbE and 40GbE is the Direct Attached Cable (DAC) type based on Twinaxial cabling. Such cables are based on copper and have transceivers directly connected to each end of the cable. For 10GbE the SFP+ standard is commonplace in server NICs and switches. 40GbE uses the slightly larger QSFP transceivers, which internally are made up of four 10Gbit/s lanes (an important feature which we’ll come back to). DAC cables exist in lengths up 10 metres (33 feet), but the price increases substantially when the cables get longer than 3 to 5 metres. When longer runs of 10GbE or 40GbE than 10 metres are needed, fiber cabling and separate transceivers are the only option. The cost of each transceiver is usually several times that of one DAC cable. Constraints like that are important to take into account when designing a data center network.

In an end-of-row design with chassis-based switches it’s relatively easy to provide the needed 40GbE ports by adding line cards with the right port type, but the cabling cost will be an issue. Servers in adjacent racks may use DAC cabling, but as the distance to each server NIC increases so does the cost of each DAC cable. In addition, fiber connections have to be used for everything longer than 10 metres, which is further adding to the cabling costs.

On the contrary, with a top-of-rack switch design where all the cabling is inside each rack, the cable lengths are limited to 3 metres at the most. That makes DAC cables a perfect fit. However, the issue then becomes how to make 40GbE ports available to servers from top-of-rack switches. Typical 10GbE datacenter switches are equipped with some 40GbE ports for uplinks to the rest of the network (commonly a spine layer in bigger designs), so it may be tempting to use some of those for the few servers requiring 40GbE in the beginning. Doing so is not advisable, since it decreases the available uplink bandwidth and limits the scalability of the design. With a leaf-spine design with four spine switches the requirement is four uplinks from each leaf. When bandwidth usage increases, you may want to increase that by either increasing the number of spines to eight or double the amount of bandwidth from each leaf to each spine. Both of these scalability options require eight 40GbE ports for spine uplinks, so if you’ve used some of those 40GbE ports to connect servers, you don’t have the necessary ports to scale.

If 10GbE switches are not an option in a mixed 10/40GbE environment where only a handful of servers need 40GbE, then what about pure 40GbE top-of-rack switches? Several vendors supply 1 RU 40GbE data center switches with 24 to 32 QSFP ports. With a majority of servers still using 10GbE, this might not seem like a viable option. Enter the DAC Breakout Cable. Using a QSFP-to-SFP+ breakout cable, each 40GbE port can be split into four individual 10GbE ports and used to connect several servers to the switch. What makes this possible is the mentioned 4 x 10Gbit/s lanes that 40GbE QSFP is made up of internally.

Using breakout cables for all server cabling in a rack enables a smooth migration path to 40GbE, since the servers that need 40GbE can get their own QSFP port on the switch. In a leaf-spine design with four spines, even a 24-port 40GbE top-of-rack switch is a viable leaf option. Using 4 ports for spine uplinks there is 20 ports available for servers, which equals a total of 80 10GbE connections using breakout cables. That’s enough for 40 servers with dual 10GbE. When a few servers migrate to 40GbE NICs, you re-cable them using QSFP DAC cables to the top-of-rack switch. If you need to scale out to eight spines, you take four more 40GbE ports for spine uplinks and adjust down the number of possible server 10GbE links accordingly. This design provides flexibility both in terms of spine uplink count and gradual 10-to-40GbE migration for servers. An then, in the near future when a majority of servers use 40GbE, you can add another leaf switch to increase the available port count (or replace the one in the rack with a model with higher port count).

~ Arne ~

* I’m aware of the 25/50 GbE initiative and it will be very interesting to see which of the 40GbE and 25 GbE technologies that will prevail in the future. 25 GbE might pan out as the most cost-efficient alternative. However, the 40GbE technology is already used in today’s networks as the preferred spine uplink technology, which makes it relatively easy to create a migration path to 40GbE for servers too.

Getting Started with SocketPlane and Docker on OpenStack VMs

With all the well-deserved attention Docker gets these days, the networking aspects of Docker become increasingly important. As many have pointed out already, Docker itself has somewhat limited networking options. Several projects exist to fix this; SocketPlane, Weave, Flannel by CoreOS and Kubernetes (which is an entire container orchestration solution). Docker recently acquired SocketPlane to become part of Docker itself, both to gain better native networking options and to get help building the networking APIs necessary for other network solutions to plug into Docker.

In this post, I’ll show how to deploy and use SocketPlane on OpenStack VMs. This is based on the technology preview of SocketPlane available on Github, which I’ll deploy on Ubuntu 14.04 Trusty VMs.

Launch the first VM and bootstrap the cluster

As SocketPlane is a cluster solution with automatic leader election, all nodes in the cluster are equal and run the same services. However, the first node has to be told to bootstrap the cluster. With at least one node running, new nodes automatically join the cluster when they start up.

To get the first node to download SocketPlane, install the software including all dependencies, and bootstrap the cluster, create a Cloud-init script like this:

cat > <<EOF
curl -sSL | sudo BOOTSTRAP=true sh
sudo socketplane cluster bind eth0

Start the first node and wait for the SocketPlane bootstrap to complete before starting more nodes (it takes a while, so grab a cup of coffee):

$ nova boot --flavor m1.medium --image "Ubuntu CI trusty 2014-09-22" --key-name arnes --user-data --nic net-id=df7cc182-8794-4134-b700-1fb8f1fbf070 socketplane1
$ nova floating-ip-associate socketplane1

You have to customize the flavor, image, key-name, net-id and floating IP to suit your OpenStack environment before running these commands. I attach a floating IP to the node to be able to log into it and interact with SocketPlane. If you want to watch the progress of Cloud-init, you can now tail the output logs via SSH like this:

$ ssh ubuntu@ "tail -f /var/log/cloud-init*"
Warning: Permanently added '' (ECDSA) to the list of known hosts.
==> /var/log/cloud-init.log <==
Mar 4 18:20:16 socketplane1 [CLOUDINIT][DEBUG]: Writing to /var/lib/cloud/instances/4e158f82-c5d8-4629-b7dc-2c1fbbe5f9f2/sem/config_scripts_vendor - wb: [420] 20 bytes
Mar 4 18:20:16 socketplane1 [CLOUDINIT][DEBUG]: Running config-scripts-vendor using lock (<FileLock using file '/var/lib/cloud/instances/4e158f82-c5d8-4629-b7dc-2c1fbbe5f9f2/sem/config_scripts_vendor'>)
Mar 4 18:20:16 socketplane1 [CLOUDINIT][DEBUG]: Running command ['/var/lib/cloud/instance/scripts/part-001'] with allowed return codes [0] (shell=False, capture=False)

==> /var/log/cloud-init-output.log <==
511136ea3c5a: Pulling fs layer
511136ea3c5a: Download complete
8771fbfe935c: Pulling metadata
8771fbfe935c: Pulling fs layer
8771fbfe935c: Download complete
0e30e84e9513: Pulling metadata

As you can see from the output above, the SocketPlane setup script is busy fetching the Docker images for the dependencies of SocketPlane and the SocketPlane agent itself. When the bootstrapping is done, the output will look like this:

7c5e9d5231cf: Download complete
7c5e9d5231cf: Download complete
Status: Downloaded newer image for clusterhq/powerstrip:v0.0.1
Requesting SocketPlane to listen on eth0
Cloud-init v. 0.7.5 finished at Wed, 04 Mar 2015 18:25:54 +0000. Datasource DataSourceOpenStack [net,ver=2]. Up 348.19 seconds

The “Done!!!” line marks the end of the setup script downloaded from The next line of output is from the “sudo socketplane cluster bind eth0” command I included in the Cloud-init script.

Important note about SocketPlane on OpenStack VMs

If you just follow the deployment instructions for a Non-Vagrant install / deploy in the SocketPlane README, you might run into an issue with the SocketPlane agent. The agent by default tries to autodetect the network interface to bind to, but that does not seem to work as expected when using OpenStack VMs. If you encounter this issue, the agent log will be full of messages like these:

$ sudo socketplane agent logs
INFO[0007] Identifying interface to bind ... Use --iface option for static binding
INFO[0015] Identifying interface to bind ... Use --iface option for static binding
INFO[0023] Identifying interface to bind ... Use --iface option for static binding
INFO[0031] Identifying interface to bind ... Use --iface option for static binding
INFO[0039] Identifying interface to bind ... Use --iface option for static binding

To resolve this issue you have to explicitly tell SocketPlane which network interface to use:

sudo socketplane cluster bind eth0

If you don’t, the SocketPlane setup process will be stuck and never complete. This step is required on all nodes in the cluster, since they follow the same setup process.

Check the SocketPlane agent logs

The “socketplane agent logs” CLI command is useful for checking the cluster state and to see what events have occured. After the initial setup process has finished, the output will look similar to this:

$ sudo socketplane agent logs
==> WARNING: Bootstrap mode enabled! Do not enable unless necessary
==> WARNING: It is highly recommended to set GOMAXPROCS higher than 1
==> Starting Consul agent...
2015/03/04 18:25:54 Watch (type: nodes) errored: Get dial tcp connection refused, retry in 5s
==> Starting Consul agent RPC...
==> Consul agent running!
 Node name: 'socketplane1'
 Datacenter: 'dc1'
 Server: true (bootstrap: true)
 Client Addr: (HTTP: 8500, HTTPS: -1, DNS: 8600, RPC: 8400)
 Cluster Addr: (LAN: 8301, WAN: 8302)
 Gossip encrypt: false, RPC-TLS: false, TLS-Incoming: false

==> Log data will now stream in as it occurs:

 2015/03/04 18:25:54 [INFO] serf: EventMemberJoin: socketplane1
 2015/03/04 18:25:54 [INFO] serf: EventMemberJoin: socketplane1.dc1
 2015/03/04 18:25:54 [INFO] raft: Node at [Follower] entering Follower state
 2015/03/04 18:25:54 [INFO] consul: adding server socketplane1 (Addr: (DC: dc1)
 2015/03/04 18:25:54 [INFO] consul: adding server socketplane1.dc1 (Addr: (DC: dc1)
 2015/03/04 18:25:54 [ERR] agent: failed to sync remote state: No cluster leader
INFO[0111] Identifying interface to bind ... Use --iface option for static binding
INFO[0111] Binding to eth0
2015/03/04 18:25:55 watchForExistingRegisteredUpdates : 0
2015/03/04 18:25:55 key :
==> WARNING: Bootstrap mode enabled! Do not enable unless necessary
==> WARNING: It is highly recommended to set GOMAXPROCS higher than 1
==> Starting Consul agent...
==> Error starting agent: Failed to start Consul server: Failed to start RPC layer: listen tcp bind: address already in use
 2015/03/04 18:25:55 [ERR] http: Request /v1/catalog/nodes, error: No cluster leader
2015/03/04 18:25:55 Watch (type: nodes) errored: Unexpected response code: 500 (No cluster leader), retry in 5s
 2015/03/04 18:25:55 [WARN] raft: Heartbeat timeout reached, starting election
 2015/03/04 18:25:55 [INFO] raft: Node at [Candidate] entering Candidate state
 2015/03/04 18:25:55 [INFO] raft: Election won. Tally: 1
 2015/03/04 18:25:55 [INFO] raft: Node at [Leader] entering Leader state
 2015/03/04 18:25:55 [INFO] consul: cluster leadership acquired
 2015/03/04 18:25:55 [INFO] consul: New leader elected: socketplane1
 2015/03/04 18:25:55 [INFO] raft: Disabling EnableSingleNode (bootstrap)
 2015/03/04 18:25:55 [INFO] consul: member 'socketplane1' joined, marking health alive
 2015/03/04 18:25:56 [INFO] agent: Synced service 'consul'
INFO[0114] New Node joined the cluster :
2015/03/04 18:25:59 Status of Get 404 Not Found 404 for http://localhost:8500/v1/kv/ipam/
2015/03/04 18:25:59 Updating KV pair for http://localhost:8500/v1/kv/ipam/ 0
2015/03/04 18:25:59 Status of Get 404 Not Found 404 for http://localhost:8500/v1/kv/network/default
2015/03/04 18:25:59 Updating KV pair for http://localhost:8500/v1/kv/network/default?cas=0 default {"id":"default","subnet":"","gateway":"","vlan":1} 0
2015/03/04 18:25:59 Status of Get 404 Not Found 404 for http://localhost:8500/v1/kv/vlan/vlan
2015/03/04 18:25:59 Updating KV pair for http://localhost:8500/v1/kv/vlan/vlan?cas=0 vlan 0

SocketPlane uses Consul as a distributed key-value store for cluster configuration and cluster membership tracking. From the log output we can see that a Consul agent is started, the “socketplane1” host joins, a leader election is performed (which this single Consul agent obviously wins), and key-value pairs for the default subnet and network are created.

A note on the SocketPlane overlay network model

The real power of the SocketPlane solution lies in the overlay networks it creates. The overlay network spans all SocketPlane nodes in the cluster. SocketPlane uses VXLAN tunnels to encapsulate container traffic between nodes, so that several Docker containers running on different nodes can belong to the same virtual network and get IP addresses in the same subnet. This resembles the way OpenStack itself can use VXLAN to encapsulate traffic for a virtual tenant network that spans several physical compute hosts in the same cluster. Using SocketPlane on an OpenStack cluster which uses VXLAN (or GRE) means we use two layers of encapsulation, which is something to keep in mind if MTU and fragmentation issues occur.

Spin up more SocketPlane worker nodes

Of course we need some more nodes as workers in our SocketPlane cluster to make it a real cluster, so create another Cloud-init script for them to use:

cat > <<EOF
curl -sSL | sudo sh
sudo socketplane cluster bind eth0

This is almost identical to the first Cloud-init script, just without the BOOTSTRAP=true environment variable.

Spin up a couple more nodes:

$ nova boot --flavor m1.medium --image "Ubuntu CI trusty 2014-09-22" --key-name arnes --user-data --nic net-id=df7cc182-8794-4134-b700-1fb8f1fbf070 socketplane2
$ nova boot --flavor m1.medium --image "Ubuntu CI trusty 2014-09-22" --key-name arnes --user-data --nic net-id=df7cc182-8794-4134-b700-1fb8f1fbf070 socketplane3

Watch the agent log from the first node in realtime with the -f flag (just like with “tail”) to validate that the nodes join the cluster as they are supposed to:

$ sudo socketplane agent logs -f
2015/03/04 19:10:42 New Bonjour Member : socketplane2, _docker._cluster, local,
INFO[6398] New Member Added :
 2015/03/04 19:10:42 [INFO] agent.rpc: Accepted client:
 2015/03/04 19:10:42 [INFO] agent: (LAN) joining: []
 2015/03/04 19:10:42 [INFO] serf: EventMemberJoin: socketplane2
 2015/03/04 19:10:42 [INFO] agent: (LAN) joined: 1 Err: <nil>
 2015/03/04 19:10:42 [INFO] consul: member 'socketplane2' joined, marking health alive
Successfully joined cluster by contacting 1 nodes.
INFO[6398] New Node joined the cluster :

2015/03/04 19:10:54 New Bonjour Member : socketplane3, _docker._cluster, local,
INFO[6409] New Member Added :
 2015/03/04 19:10:54 [INFO] agent.rpc: Accepted client:
 2015/03/04 19:10:54 [INFO] agent: (LAN) joining: []
 2015/03/04 19:10:54 [INFO] serf: EventMemberJoin: socketplane3
 2015/03/04 19:10:54 [INFO] agent: (LAN) joined: 1 Err: <nil>
 2015/03/04 19:10:54 [INFO] consul: member 'socketplane3' joined, marking health alive
Successfully joined cluster by contacting 1 nodes.
INFO[6409] New Node joined the cluster :

The nodes joined the cluster as expected, with no need to actually SSH into the VMs and run any CLI commands, since Cloud-init took care of the entire setup process. As you may have noted I didn’t allocate any floating IP to the new worker VMs, since I don’t need access to them directly. All the VMs run in the same OpenStack virtual tenant network and are able to communicate internally on that subnet ( in my case).

Create a virtual network and launch the first container

To test the new SocketPlane cluster, first create a new virtual network “net1” with an address range you choose yourself:

$ sudo socketplane network create net1
 "gateway": "",
 "id": "net1",
 "subnet": "",
 "vlan": 2

Now you should have two SocketPlane networks, the default and the new one you just created:

$ sudo socketplane network list
 "gateway": "",
 "id": "default",
 "subnet": "",
 "vlan": 1
 "gateway": "",
 "id": "net1",
 "subnet": "",
 "vlan": 2

Now, launch a container on the virtual “net1” network:

$ sudo socketplane run -n net1 -it ubuntu /bin/bash
Unable to find image 'ubuntu:latest' locally
fa4fd76b09ce: Pulling fs layer
1c8294cc5160: Pulling fs layer
2d24f826cb16: Download complete
Status: Downloaded newer image for ubuntu:latest

The “-n net1” option tells SocketPlane what virtual network to use. The container is automatically assigned a free IP address from the IP address range you chose. I started an Ubuntu container running Bash as an example. You can start any Docker image you want, as all arguments after “-n net1” are passed directly to the “docker run” command which SocketPlane wraps.

The beauty of SocketPlane is that you don’t have to do any port mapping or linking for containers to be able to communicate with other containers. They behave just like VMs launched on a virtual OpenStack network and have access to other containers on the same network, in addition to resources outside the cluster:

root@4e06413f421c:/# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default
8: ovsaa22ac2: <BROADCAST,UP,LOWER_UP> mtu 1440 qdisc noqueue state UNKNOWN group default
 link/ether 02:42:0a:64:00:02 brd ff:ff:ff:ff:ff:ff
 inet scope global ovsaa22ac2
 valid_lft forever preferred_lft forever
 inet6 fe80::ace8:d3ff:fe4a:ecfc/64 scope link
 valid_lft forever preferred_lft forever

root@4e06413f421c:/# ping -c 1
PING ( 56(84) bytes of data.
64 bytes from icmp_seq=1 ttl=64 time=0.043 ms

root@4e06413f421c:/# ping -c 1
PING ( 56(84) bytes of data.
64 bytes from icmp_seq=1 ttl=51 time=29.7 ms

Multiple containers on the same virtual network

Keep the previous window open to keep the container running and SSH to the first SocketPlane node again. Then launch another container on the same virtual network and ping the first container to verify connectivity:

$ sudo socketplane run -n net1 -it ubuntu /bin/bash
$ root@7c30071dbab4:/# ip addr | grep 10.100
 inet scope global ovs658b61c
$ root@7c30071dbab4:/# ping
PING ( 56(84) bytes of data.
64 bytes from icmp_seq=1 ttl=64 time=0.307 ms
64 bytes from icmp_seq=2 ttl=64 time=0.057 ms

As expected, both containers see each other on the subnet they share and can communicate. However, both containers run on the first SocketPlane node in the cluster. To prove that this communication works also between different SocketPlane nodes, I’ll SSH from the first to the second node and start a new container there. To SSH between nodes I’ll use the private IP address of the second SocketPlane VM, since I didn’t allocate a floating IP to it:

ubuntu@socketplane1:~$ ssh
ubuntu@socketplane2:~$ sudo socketplane run -n net1 -it ubuntu /bin/bash
Unable to find image 'ubuntu:latest' locally
fa4fd76b09ce: Pulling fs layer
2d24f826cb16: Download complete
Status: Downloaded newer image for ubuntu:latest

root@bfde7387e160:/# ip addr | grep 10.100
 inet scope global ovs06e4b44

root@bfde7387e160:/# ping -c 1
PING ( 56(84) bytes of data.
64 bytes from icmp_seq=1 ttl=64 time=1.53 ms

root@bfde7387e160:/# ping -c 1
PING ( 56(84) bytes of data.
64 bytes from icmp_seq=1 ttl=64 time=1.47 ms

root@bfde7387e160:/# ping -c 1
PING ( 56(84) bytes of data.
64 bytes from icmp_seq=1 ttl=51 time=30.2 ms

No trouble there, the new container on a different OpenStack VM can reach the other containers and also communicate with the outside world.

This concludes the Getting Started-tutorial for SocketPlane on OpenStack VMs. Please bear in mind that this is based on a technology preview of SocketPlane, which is bound to change as SocketPlane and Docker become more integrated in the months to come. I’m sure the addition of SocketPlane to Docker will bring great benefits to the Docker community as a whole!

~ Arne ~

How to configure Knife and Test Kitchen to use OpenStack

When developing Chef cookbooks, Knife and Test Kitchen (hereafter just “Kitchen”) are essential tools in the workflow. Both tools can be set up to use OpenStack to make it easy to create VMs for testing regardless of the capabilities of the workstation used. It’s great for testing some new recipe in a cookbook or making sure changes do not break existing cookbook functionality. This post will go through the configuration of both tools to ensure they use OpenStack instead of the default Vagrant drivers.

Install software and dependencies

First, it is necessary to install the software, plugins and dependencies. Let’s start with some basic packages:

sudo apt-get install ruby1.9 git
sudo apt-get install make autoconf gcc g++ zlib1g-dev bundler

Chef Development Kit

The Chef Development Kit is a collection of very useful tools for any cookbook developer. It includes tools like Knife, Kitchen, Berkshelf, Foodcritic, and more. Fetch download links for the current release from the Chef-DK download page and install it, for example like this for Ubuntu:

sudo dpkg -i chefdk_0.4.0-1_amd64.deb

Kitchen OpenStack driver

By default, Kitchen uses Vagrant as the driver to create virtual machines for running tests. To get OpenStack support, install the Kitchen OpenStack driver. The recommended way of installing it is to add the Ruby gem to the Gemfile in your cookbook and use Bundler to install it:

echo 'gem "kitchen-openstack"' >> Gemfile
sudo bundle

Knife OpenStack plugin

With the OpenStack plugin Knife is able to create new OpenStack VMs and bootstrap them as nodes on your Chef server. It can also list VMs and delete VMs. Install the plugin with:

gem install knife-openstack

OpenStack command line clients

The command line clients for OpenStack are very useful for checking values like image IDs, Neutron networks and so on. In addition, they offer one-line access to actions like creating new VMs, allocating new floating IPs and more. Install the clients with:

sudo apt-get install python-novaclient python-neutronclient python-glanceclient

Configure Knife to use OpenStack

After installing the plugin to get OpenStack support for Knife, you need to append some lines to the Knife config file “~/.chef/knife.rb”:

cat >> ~/.chef/knife.rb <<EOF
# Knife OpenStack plugin setup
knife[:openstack_auth_url] = "#{ENV['OS_AUTH_URL']}/tokens"
knife[:openstack_username] = "#{ENV['OS_USERNAME']}"
knife[:openstack_password] = "#{ENV['OS_PASSWORD']}"
knife[:openstack_tenant] = "#{ENV['OS_TENANT_NAME']}"

What these lines do is to instruct Knife to use the contents of environment variables to authenticate with OpenStack when needed. The environment variables are the ones you get when you source the OpenStack RC file of your project. The RC file can be downloaded from the OpenStack web UI (Horizon) by navigating to Access & Security -> API Access -> Download OpenStack RC file. Sourcing the file makes sure the environment variables are part of the current shell environment, and is done like this (for an RC file called “”):

$ .

With this config in place Knife now has the power to create new OpenStack VMs in your project, list all active VMs and destroy VMs. In addition, it can be used to list available images, flavors and networks in OpenStack. I do however prefer to use the native OpenStack clients (glance, nova, neutron) for that, since they can perform lots of other valuable tasks like creating new networks and so on.

Below is an example of VM creation with Knife, using some of the required and optional arguments to the command. Issue “knife openstack server create –help” to get all available arguments. As a quick summary, the arguments I give Knife are the requested hostname of the server, the flavor (3 = m1.medium in my cluster), image ID of a CentOS 7 image, network ID, SSH key name and the default user account used by the image (“centos”).

With the “–openstack-floating-ip” argument I tell Knife to allocate a floating IP to the new server. I could have specified a specific floating IP after that argument, which would have been allocated to the new server whether it was in use before or not. The only requirement is that it must be allocated to my OpenStack project before I try to use it.

$ knife openstack server create -N test-server -f 3 -I b206baa3-3a80-41cf-9850-49021b8bb3c1 --network-ids df7cc182-8794-4134-b700-1fb8f1fbf070 --openstack-ssh-key-id arnes --ssh-user centos --openstack-floating-ip --no-host-key-verify

Waiting for server [wait time = 600].........................
Instance ID 13493d82-8dc2-4b1d-87e8-3eeefa8defe2
Name test-server
Flavor 3
Image b206baa3-3a80-41cf-9850-49021b8bb3c1
Keypair arnes
Availability Zone nova
Floating IP Address:
Bootstrapping the server by using bootstrap_protocol: ssh and image_os_type: linux

Waiting for sshd to host (
Connecting to Installing Chef Client... Downloading Chef 11 for el... Installing Chef 11 Thank you for installing Chef! Starting first Chef Client run...
... Running handlers complete Chef Client finished, 0/0 resources updated in 1.328282722 seconds
Instance ID 13493d82-8dc2-4b1d-87e8-3eeefa8defe2
Name test-server
Public IP
Flavor 3
Image b206baa3-3a80-41cf-9850-49021b8bb3c1
Keypair arnes
Availability Zone nova

As an added benefit of creating VMs this way, they are automatically bootstrapped as Chef nodes with your Chef server!

Configure Kitchen to use OpenStack

Kitchen has a config file “~/.kitchen/config.yml” where all the config required to use OpenStack should be placed. The config file is “global”, meaning it’s not part of any cookbook or Chef repository. The advantage of using the global config file is that the Kitchen config in each cookbook is reduced to just one line, which is good since that Kitchen config is commonly committed to the cookbook repository and shared with other developers. Other developers may not have access to the same OpenStack environment as you, so their Kitchen OpenStack config will differ from yours.

Run the following commands to initialize the necessary config for Kitchen:

mkdir ~/.kitchen
cat >> ~/.kitchen/config.yml <<EOF
 name: openstack
 openstack_username: <%= ENV['OS_USERNAME'] %>
 openstack_api_key: <%= ENV['OS_PASSWORD'] %>
 openstack_auth_url: <%= "#{ENV['OS_AUTH_URL']}/tokens" %>
 openstack_tenant: <%= ENV['OS_TENANT_NAME'] %>
 require_chef_omnibus: true
 image_ref: CentOS 7 GC 2014-09-16
 username: centos
 flavor_ref: m1.medium
 key_name: <%= ENV['OS_USERNAME'] %>
 floating_ip_pool: public
 - net1
 no_ssh_tcp_check: true
 no_ssh_tcp_check_sleep: 30

There is quite a bit of config going on here, so I’ll go through some of the most important parts. Many of the configuration options rely on environment variables which are set when you source the OpenStack RC file, just like for Knife. In addition, the following options may need to be customized according to your OpenStack environment:

  • image_ref: The name of a valid image to use when creating VMs
  • username: The username used by the chosen image, in this case “centos”
  • flavor_ref: A valid name of a flavor to use when creating VMs
  • key_name: Must match the name of your SSH key in OpenStack, here it is set to equal your username
  • floating_ip_pool: The name of a valid pool of public IP addresses
  • network_ref: A list of existing networks to connect new VMs to

To determine the correct values for image, flavor and network above, use the command line OpenStack clients. The Glance client can output a list of valid images to choose from:

$ glance image-list
| ID                                   | Name     | Disk Format | Container Format | Size | Status |
| ee2cc71b-3e2e-4b11-b327-f9cbf73a5694 | CentOS 6 GC 14-11-12   | raw | bare | 8589934592 | active |
| b206baa3-3a80-41cf-9850-49021b8bb3c1 | CentOS 7 GC 2014-09-16 | raw | bare | 8589934592 | active |

Set the image_ref in the Kitchen config to either the ID, the name or a regex matching the name.

Correspondingly, find the allowed flavors with the Nova client:

$ nova flavor-list
| ID | Name      | Memory_MB | Disk | Ephemeral | Swap | VCPUs | RXTX_Factor | Is_Public |
|  2 | m1.small  |    2048   |  20  |     0     |      |   1   |     1.0     |    True   |
|  3 | m1.medium |    4096   |  40  |     0     |      |   2   |     1.0     |    True   |

The network names are available using the neutron client. However, if you haven’t created any networks yet, you can create a network, subnet and router like this:

neutron net-create net1
neutron subnet-create --name subnet1 net1
neutron router-create gw
neutron router-gateway-set gw public
neutron router-interface-add gw subnet1

These commands assume that the external network in your OpenStack cluster is named “public”. Assuming the commands complete successfully you may use the network name “net1” in the Kitchen config file. To get the list of available networks, use the Neutron client with the net-list subcommand:

$ neutron net-list
| id                                   | name   | subnets |
| 2d2b2336-d7b6-4adc-b7f2-c92f98d4ec58 | public | 5ac43f4f-476f-4513-8f6b-67a758aa56e7 |
| e9dcbda9-cded-4823-a9fe-b03aadf33346 | net1   | 8ba65517-9bf5-46cc-a392-03a0708cd7f3 |

With all that configured, Kitchen is ready to use OpenStack as the driver instead of Vagrant. All you need to do in a cookbook to make Kitchen use the OpenStack driver, is to change the “driver” statement in the “.kitchen.yml” config file in the cookbook root directory from “vagrant” to “openstack”:

 name: openstack

So, lets take it for a spin:

$ kitchen create
-----> Starting Kitchen (v1.2.1)
-----> Creating <default-ubuntu-1404>...
 OpenStack instance <c08688f6-a754-4f43-a365-898a38fc06f8> created.
(server ready)
 Attaching floating IP from <public> pool
 Attaching floating IP <>
 Waiting for
 Waiting for
 Waiting for
 (ssh ready)
 Using OpenStack keypair <arnes>
 Using public SSH key <~/.ssh/>
 Using private SSH key <~/.ssh/id_rsa>
 Adding OpenStack hint for ohai
Finished creating <default-ubuntu-1404> (0m50.68s).
-----> Kitchen is finished. (0m52.22s)

Voilà 🙂

How to Use Cloud-init to Customize New OpenStack VMs

When creating a new instance (VM) on OpenStack with one of the standard Ubuntu Cloud images, the next step is typically to install packages and configure applications. Instead of doing that manually every time, OpenStack enables automatic setup of new instances using Cloud-init. Cloud-init runs on first boot of every new instance and initializes it according to a provided script or config file. The functionality is part of the Ubuntu image and works the same way regardless of the cloud provider used (Amazon, RackSpace, private OpenStack cloud). Cloud-init is also available for other distributions as well.

Creating a customization script

Standard Bash script

Perhaps the easiest way to get started is to create a standard Bash script that Cloud-init runs on first boot. Here is a simple example to get Apache2 up and running:

$ cat > <<EOF
> #!/bin/bash
> apt-get update
> apt-get -y install apache2
> a2ensite 000-default

This small script installs the Apache2 package and enables the default site. Of course, you’d likely need to do more configuration here before enabling the site, like an rsync of web content to document root and enabling TLS.

Launch a new web instance

Use the nova CLI command to launch an instance named “web1” and supply the filename of the customization script with the “–user-data” argument:

$ nova boot --flavor m1.medium --image "Ubuntu CI trusty 2014-09-22" --key-name arnes web1
| Property | Value         |
| name     | web1          |
| flavor   | m1.medium (3) |

To access the instance from outside the cloud, allocate a new floating IP and associate it with the new instance:

$ nova floating-ip-create public
| Ip         | Server Id | Fixed Ip | Pool   |
| |           | -        | public |
$ nova floating-ip-associate web1


The new web instance has Apache running right from the start, no manual steps needed:

Default Apache2 page

More Cloud-init options: Cloud-Config syntax

Cloud-init can do more than just run bash scripts. Using cloud-config syntax many different actions are possible. The documentation has many useful examples of cloud-config syntax to add user accounts, configure mount points, initialize the instance as a Chef/Puppet client and much more.

For example, the same Apache2 initialization as above can be done with the following cloud-config statements:

 - apache2
 - [ a2ensite, "000-default" ]

Including scripts or config files

Including a script or config file from an external source is also possible. This can be useful if the config file is under revision control in Git. Including files is easy, just replace the script contents with an include statement and the URL:


The gist contains the same cloud-config statements as above, so the end result it the same.


Cloud-init logs messages to /var/log/cloud-init.log and in my tests even debug level messages were logged. In addition, Cloud-init records all console output from changes it performs to /var/log/cloud-init-output.log. That makes it easy to catch errors in the initialization scripts, like for instance when I omitted ‘-y’ to apt-get install and package installation failed:

The following NEW packages will be installed:
 apache2 apache2-bin apache2-data libapr1 libaprutil1 libaprutil1-dbd-sqlite3
 libaprutil1-ldap ssl-cert
0 upgraded, 8 newly installed, 0 to remove and 88 not upgraded.
Need to get 1284 kB of archives.
After this operation, 5342 kB of additional disk space will be used.
Do you want to continue? [Y/n] Abort.
/var/lib/cloud/instance/scripts/part-001: line 4: a2ensite: command not found
2015-02-05 09:59:56,943 -[WARNING]: Failed running /var/lib/cloud/instance/scripts/part-001 [127]
2015-02-05 09:59:56,944 -[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
2015-02-05 09:59:56,945 -[WARNING]: Running scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_scripts_user.pyc'>) failed
Cloud-init v. 0.7.5 finished at Thu, 05 Feb 2015 09:59:56 +0000. Datasource DataSourceOpenStack [net,ver=2]. Up 22.14 seconds

The line “Do you want to continue? [Y/n] Abort.” is a clear indicator that apt-get install failed since it expected user input. Most CLI tools can be run without user input by just passing the correct options, like ‘-y’ to apt-get. After correcting that error, the output is as expected:

The following NEW packages will be installed:
 apache2 apache2-bin apache2-data libapr1 libaprutil1 libaprutil1-dbd-sqlite3
 libaprutil1-ldap ssl-cert
0 upgraded, 8 newly installed, 0 to remove and 88 not upgraded.
Need to get 1284 kB of archives.
After this operation, 5342 kB of additional disk space will be used.
Get:1 trusty/main libapr1 amd64 1.5.0-1 [85.1 kB]
Get:2 trusty/main libaprutil1 amd64 1.5.3-1 [76.4 kB]
Cloud-init v. 0.7.5 running 'modules:final' at Thu, 05 Feb 2015 12:35:49 +0000. Up 38.42 seconds.
Site 000-default already enabled
Cloud-init v. 0.7.5 finished at Thu, 05 Feb 2015 12:35:49 +0000. Datasource DataSourceOpenStack [net,ver=2]. Up 38.56 seconds

This also reveals that the command “a2ensite 000-default” is not needed since the default site is enabled already. However, it’s included here as an example of how to run shell commands using cloud-config statements.

Testing vs Production

Using Cloud-init to get new instances to the desired state is nice when testing and a necessary step when deploying production instances. In a production context, one would probably use Cloud-init to initialize the instance as a Chef or Puppet client. From there, Chef/Puppet takes over the configuration task and will make sure the instance is set up according to the desired role it should fill. Cloud-init makes the initial bootstrapping of the instance easy.

New Skill Sets Needed for Network Specialists

I share a lot of the views presented by @netmanchris in his plan for technology areas to focus on and the follow-up post It Generalist or Network Specialist?. I started out in this field as a Networking Specialist focusing on the traditional areas like switching, routing and firewalls. However, over time the need for automation, scripting and data analysis popped up more and more in my day job, to be able to automate manual tasks and improve the quality of networking services like firewall rulesets.

I’ve written about a use case for data analysis in networking in a previous post. Another example I want to highlight is the need for automatically updating firewall object-groups with new IP addresses as DNS entries for remote services change. Not all firewalls support DNS names as destination in a firewall rule. When a server need access to a specific remote site, and that site changes IP addresses from time to time, the IP adresses in the firewall config need to change too. To facilitate this I implemented a solution using Python which keeps object-groups in the firewall config in sync with IP addresses in DNS replies.

So far, these examples are from the traditional networking field. There is a new kind of Networking in the making with the invention of SDN solutions and the pervasive virtualization of everything, not just servers. Almost everything in this new Networking has APIs to support automation and requires us to learn some programming to be able to implement efficient solutions.

Python has emerged as the language of choice to sysadmins and net admins. To be able to use Python efficiently, basic knowledge of relevant modules is key. For example Paramiko for using SSH in scripts and IPy to handle IP addresses and subnets easily. To interface with APIs there may be specific modules available like python-neutronclient for the OpenStack Neutron API, if not you can always use HTTP via Requests.

In addition to Python I see knowledge of Git, Chef/Puppet, hypervisors, Linux networking, OVS, OpenStack and so on as very important in this new infrastructure-as-code movement. The new networking engineer role will be a hybrid sysadmin/netadmin/devop role and brings with it opportunities to make every day at work even more interesting!

~ Arne ~