Data visualization with ggplot2 in R

ggplot2 is a widely used r package to visualize data. It provides many different commands to present various visualizations of the data, and it is also quite flexible to adjust if there’s any change need to be made. For example, if we made some changes to the dataset, such as adding some more data points into the file, it will take longer time for other data visualization tools, because the dataset has changed. However, in r, we can just load the new dataset and still give it the same name. Therefore, learning how to visualize graphs in programming languages seems to be more necessary in this case. This tutorial will walk you through some useful commands to visualize the data, and it aims to provide another way for data visualization.

First, we need to install ggplot2 package into R studio so we can use it. However, installing the package doesn’t mean you can use this package for the current session. Therefore, we need to call this package using the library command. In order to visualize the data, we also need to import the data. Here, we are using the penguins data from palmerpenguins package.

code for using the package and importing dataset

The most basic command of ggplot2, which creates the visualization plot, is the ggplot() command. In here, we need to plug in the information for data, as well as x and y axis. In this example, we would use the penguins’ bill lengths (variable name bill_length_mm) as x axis and body weight (variable name body_mass_g) as y axis. The command line is shown below: we put the data name first, and then plug in the x variable’s column name and y variable’s column name. This allows creativity of the type of analysis that we can do, and it doesn’t restrict the user from doing different attempts on the plot.

base plot

Both of the variables are quantitative, which means they are meaningful numbers, one way to represent the relation ship is to use the scatter plot. In order to show the scatter plot, we can use the geom_point() command to add the points. Since we already have the base plot, we can just add the command after the base command and connect them with “+”. ggplot will plot the commands accordingly.

plotting body mass against bill length of penguins

We can also make adjustments to the points. For example, the data contains three different penguin species, and we can visualize the species by giving each of them a different color.

coloring the dots with different types of penguins

We can also visualize the data using other plots. For example, histogram is a good one to look at the distribution of a quantitative variable. We can use geom_histogram() command and mapping=() command to tell r what column we want to plot. Here, we are curious about the distribution of the bill length, so we use it as the x axis.

histogram with ggplot2

We can also change the name of x-axis and y-axis if we want it to be more specific, which is something that some other data visualization tools that select the variables and generate graphs based on the columns name can’t do. We can just add another command xlab(), and connect it with “+” to add a title for the x-axis. The text need to be contained in the quotation marks, or else it will be recognized as commands.

add x-axis title with xlab() command

In addition, we can also add a title for the graph by using ggtitle() command. Just need to add quotation marks for the title so it could be recognized as a string.

add title for the graph

There are more useful commands in ggplot2, here are some tutorials and the official documentation. Different from the already written data visualization tools, it gives a more flexible and customizable environment for users who have higher and more specific requirements towards their output. It also works well with other data processing packages in r, which is more convenient for users to tidy and clean the raw data and visualize the data with just one tools. If several rows of data is left while cleaning in Open Refine, and therefore we need to do the visualization process in the visualization tools allover again, it will be waste of a lot of time. But because of the programming characteristic of r, it is much easier to make changes on the steps before data visualization, which is one of the reason that I recommend knowing some ggplot2 and r command while dealing with data.

Nina Sun

One Comment

  1. Hi Nina, this is a very simple but powerful tool for data visualization! As a person who is studying CS, the R commands are intuitive and easy to learn. I am impressed by its simplicity and openness to various customizations for data visualization. Thank you for creating a tutorial!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.