R is the language which was created in the early 1990s by Ross Ihaka and Robert Gentleman while they were working at the University of Auckland. R is actually based upon another language called S which was developed by John Chambers and colleagues at Bell Laboratories in 1970s. Although it’s been around for over 15 years, R has been receiving a lot of attention recently. Best thing about R is that it’s free and open-source software, which means you can download it for free and start working on it in no time. R is a great tool for statistical computing and data visualization, supporting data manipulation and transformation, also sophisticated graphical displays. R is being taught in colleges and universities in courses of statistics and advanced analytics, even replacing the more traditional statistical tools. It also focuses on specific industrial analytical fields, such as bioinformatics, finance, econometrics, medical and many more industrials needs.
So now how did I get started with R. Well there is many online resources which gives very good knowledge of R starting with the very basics. The best way to learn R is by doing and practicing more and more. There is a very good tutorial for R on DataCamp which gives a very good introduction along with practice. One more very good free online interactive learning tutorial for R is available by O’reilly’s code school website called try R. I choose the code school Try R because it gives very good knowledge on the basics of R in a short time and also very good examples of charts and maps. There are 7 badges to earn in this tutorial. They choose the pirate theme, maybe because we all are considered as pirates and looking for the treasure which is hidden in sea of the big data. Below is the badges I have earned.
How I achieved the graphics
First of all to start working with R I needed the data that I can play around. I choose to get the data for the fatalities on Irish roads from 2000 till 2015. I choose to get this data because I wanted to visualize the statistics of fatalities on Irish road and the change came in last decade. While I have the data now, so next step was to import the data in R. So to do that the first step is to set the working directory where you want to save all your work.
# This command is to set the working directory.
# Check that your working directory has been correctly set
In my case I got the return like that “D:/Study/R” and which was correct. Well now the working directory is set, next step is to read the file from the directory and to make it easy I put that in the variable, so I can use the variable later on for doing the analysis and plotting the graphs.
# Reading the csv file from directory and putting it in variable called fatalities.
>fatalities <- read.csv(“Fatalities.csv”, sep = “,”, header = TRUE)
read.csv – to tell R that file is in csv format
sep – to tell R that data in file is separated in “,”(comma) format
header – to tell if there is a header or not. You can use TRUE or T if there is header and FALSE or F if there is no header.
Note : R is case sensitive
To plot the graphs and analyse the data I used the ggplot2 visualization package in R, ggplot2 is easy to understand yet so powerful to visualize the data. It is one of the most popular package in R. It is not inbuilt package so you have to install it in R. You can use the command to download any package in R.
# To install any package
# To install ggplot2 package. You only have to install the package once.
# Load the ggplot2 package in current session. Have to load the package every time you runs the session.
So, by now we have set the working directory, read the file, installed and load the ggplot2 package in current working session. Now is the time to visualise the data by using ggplot. First of all very simple chart which shows the line chart with points on every value. X-axis presents the year and Y-axis presents the fatalities in the year on Irish roads. Please see Chart-1.1. I was able to get this chart by the code of line below:
# Ireland Line Chart
ggplot(data = fatalities, aes(x=Year, y=Death, group=1)) + geom_line(size=1.5, colour=”cornflowerblue”) + geom_point(colour=”black”, size=3) + expand_limits(y=0) + ylab(“Fatalities”) + ggtitle(“Fatalities in Ireland from 2000 – 2015”)
Data = which data to choose from. Here we choose fatalities which was the variable where we put the file.
Aes() = aesthetic a mapping of variables to various parts of the plot.
Geom_line() = adds a layer of line
Geom_point() = adds a layer of points
expand_limits() = sets the axis minimum to zero
ylab() = Labels the y-axis( xlab() can be used to label x-axis)
ggtitle = gives the title on the top of the graph
I then play around with data to make the chart more visual and informative. Next, I put the values on the points so it will be easy to know what is the number on every point. Just by adding a simple set of line which labels the number on the line. Please see the Chart-1.2
# for adding labels to the points on graph in ggplot2
But I would like to have the area filled with the colour which will makes it looks better. See in Chart-1.3.
# Fill the area of map with colour. Where alpha is used for Transparency of colour
What can be gleam from this data and graphs?
Here I have used the line graphs simply because they are most useful for showing trends. They uses line to connect the data points on the graph, and to identifying whether two variables relate to (or “correlate with”) one another. Where the y-axis indicates a quantity or percent and the x-axis represents units of time, the line graph is often referred to as a time series graph. As we can see from the graphs that since 2000 the number of fatalities starting to drop but then they goes up again in 2005 to the same level as in 2000. But, after 2005 there was a steady drop in the number of fatalities till 2012 which was as low as 162 in a year. Then again there was a small increase for 2 years and then in 2015 it comes close to the numbers as in 2012. So, according to this graph we can analyse that there is a big drop in the number of fatalities on Irish roads in last decade. For sure all the credit goes to RSA (Road Safety Authority) and to those who are behind the wheels. It can only be possible, when people will be more aware of the rules and regulations and follow those rules properly.
What else can be compared?
As we have seen the drop in number of fatalities in Ireland, I thought it would be nice to compare this data with few other countries in Europe. So, I got the data for Ireland, France, UK and Germany to compare and visualize it in R. It won’t be fair to compare the data of other countries with Ireland as they are way more populated in compare to Ireland. That’s why the data I got was the number of fatalities/Millions of population which gives the better view. At first you can see (Chart-2.1) there is just the scatter plot with black dots of all the values comparing each country. But, it doesn’t really give a clear comparison because, it’s hard to see which point represents which country. The code is
# Reading the file and assigning to the variable
fatalities3 <- read.csv(“Europe Fatalities.csv”, sep = “,”, header = TRUE)
# Scatter plot with black dots
ggplot(fatalities3, aes(x=factor(Year), y=Death)) + geom_point(shape=19) + ylab(“Fatalities”) + xlab(“Year”) + ggtitle(“Fatalities/Million in France/Germany/Ireland/UK from 2000 – 2014”)
So, next step was to give colors and shape to points which makes it easier for us to separate and understand the values as shown in Chart-2.2. The code is
# Scatter Plot with different shapes and colour comparing all 4 countries.
ggplot(fatalities3, aes(x=factor(Year), y=Death, shape=Country, colour=Country, group=Country)) + geom_point(size=3) + ylab(“Fatalities”) + xlab(“Year”) + ggtitle(“Fatalities/Million in France/Germany/Ireland/UK from 2000 – 2014”)
But still it’s not that visually great to see the data. So now is the time to draw lines for each country and connect the points and make it more informative and visually easy to read. That’s why it is very easy to read the information of time with line chart. See in Chart-2.3.
# Line Graph comparing all four countries
ggplot(fatalities3, aes(x=factor(Year), y=Death, colour=Country, group=Country, label=”Death”)) + geom_line(size=1) + expand_limits(y=0) + geom_point(size=2) + ylab(“Fatalities”) + xlab(“Year”) + ggtitle(“Fatalities/Million in France/Germany/Ireland/UK from 2000 – 2014”)
By looking on the above charts we can analyse that every country had very high number of fatalities in year 2000. But France was really high in numbers and UK was the lowest. It’s good to see that number of fatalities is dropping across Europe with a steady rate and they are lowest of all time.
Other Ideas or concepts
There is many ideas or concepts can be shown in R if I had more time. Where fatalities on road can occur because of different reasons and conditions at the time of accident. Such as weather, natural light condition, road condition, driver’s movement, other vehicle movement etc. By collecting and analysing the data and see the trends it would be easy to predict the where accident is more likely to happen. So by analysing the data we might know that accidents are more likely to happen when it’s more traffic which is 8-10 in morning or 4-6 in evening, also other factor could be the age of the person, gender of the person, speed of the vehicle etc. So, R is a great tool for analysing the statistics and have great graphic tools to visualise the big sets of data which make it easy for us to see the trends and predict the feature. The main point is to use the right command and to have the right and trustworthy data which can give us the power of predicting the future.