When working with data, it is common to come across the SQL and R programming languages when researching common reporting and manipulation tasks. In this article, I will summarise the differences and cite which situations would call for which programming language to use, along with a brief history of R and SQL. I will also add some common usage examples of both so that you can get an idea of how basic tasks are performed in each language.
What is SQL?
In short, SQL (Structured Query Language) is a standardised language used to query and command databases. For this article, you can think of databases as a collection of tables that contain data and can be optionally linked to each other. Databases do more than this, and they all do this in slightly different ways, but that is out of the scope of this article. SQL gives the user the ability to read and modify the data in these tables.
What is R?
According to r-project.org, R is both a programming language and an environment for statistical computing and graphics. Created by statisticians, R is one of the most commonly used programming languages in data mining, but it also offers a lot of functionality for statisticians, data analysis and bioinformaticians. R is an open-source implementation of the S programming language. R has graphical facilities that can be used to create visual plots/graphs of data.
Should I use SQL or R for data analysis?
Basic data querying can be handled with SQL statements, and up to a certain degree, you could handle most analytics with SQL. But once the analysis becomes more complicated, you will start to see how the syntax of SQL will become more difficult to work with and how R can breeze through these situations.
It is often best to think of SQL as the way to do basic operations on your dataset. For example, reading, writing, updating and deleting data operations should be done through SQL. Analysing the resulting datasets should be done through R. These functions are what both languages were created for and what they do well.
Is R Faster Than SQL?
As SQL and R can do some of the same tasks, but overall, they do a lot of different work, you can’t compare the speed of these two languages against each other. You can use packages in R that will allow you to run SQL commands from within R (like in most other programming languages). So the real answer isn’t whether R is faster than SQL. It is why not use both?
Is SQL harder to learn than R?
The statements and keywords that SQL uses are closer to English than the commands in R. This means that initially learning SQL seems to make more sense to beginners, even though the meaning behind seemingly simple SQL terms might differ from what you expect.
For example, the difference between the keywords RESET and RESTART isn’t immediately obvious from an English language point of view, so users can trip themselves up if they aren’t positive about what these keywords will do.
Another component that adds to the difficulty of learning R is the need for more community/documentation available online. This is not to say that the community or documentation is lacking. It highlights how much information there is for SQL and the related database technologies that implement SQL to the standard definition.
Working with data in R
Here is an example of generating a scatter plot using R.
First, you need to read your data into the R environment like so:
data <- read.csv("/Users/brnie/Downloads/acs-jersey-city-demographic-characteristics-2013-1.csv", TRUE, ",")
Now we have the data available in R. We can look at the start of the dataset:
This will display the first few rows of the dataset.
Finally, we can generate the scatter plot with the plot function:
plot(data$year, data$value, ylab="Value", xlab="Year")
As you can see in the chart above, the data contains entries from 2008-2014. We can test this in R using the min and max functions.
We can also calculate the mean of the Value column using the mean() function.
Working with data in SQL
Using the same data set loaded into a PostgreSQL database, I will work through the same examples performed in R (excluding the chart, as SQL isn’t made for graphical output).
First, I will determine the minimum year and then the maximum year from the year column.
select max(jersey.year) from jersey;
select min(jersey.year) from jersey;
Finally, I will calculate the mean of the Value column. To do this in SQL, you use the AVG function.
select avg(jersey.year) from jersey;
SQL and R are both programming languages that have some overlap in how they work with data. SQL was designed with housing data, building data relationships, and allowing modifications and queries to retrieve data sets in mind. R was built on the foundation of performing data analytics and quick charting of data. Both languages can be used together or alone, depending on the situation.