Tutorials > Open-refine

Data Exploration with OpenRefine

OpenRefine is a power tool for cleaning data. In this tutorial, we learn basic operations for exploring the data in Refine.

The OpenRefine homepage.

Download it here. This tutorial uses the 2.6 beta, but the mechanics are largely the same as they are in 2.5.

The dataset for this tutorial will be the Federal Election Commission's list of Iowa individual contributors to the 2008 U.S. Presidential Race, which you can find as a zipped file here.

About starting the OpenRefine application

One of the nicest things about OpenRefine is that it works from your web browser, which makes it about as intuitive as learning to use Google Spreadsheets. However, it is not an Internet application, in the sense that you don't have to be connected to the Internet to use it.

When you open the OpenRefine application, you're starting a background server process (a black Terminal window may pop up for the briefest moment). It won't even seem that anything has been opened because, well, OpenRefine is in the background.

However, if everything is working as expected, you should be able to connect your browser to the address http://127.0.0.1:3333/

The address of 127.0.0.1 is actually your own computer. And that :3333 specifies that you're connecting to port 3333 of your computer, which is what the OpenRefine app is currently running from.

Creating a new project

The dataset for this tutorial will be the Federal Election Commission's list of Iowa individual contributors to the 2008 U.S. Presidential Race, which you can find as a zipped file here.

Upon opening Refine, we are presented with a minimal homepage for managing Refine projects. These projects are stored in a local directory on your computer (again, Refine doesn't need to touch the Internet), and every project you create can be re-opened from here. You can also import Refine projects from other computers.

  1. Select the Create Project
  2. At the "Get data from" prompt, choose This Computer
  3. Then click the Choose Files button

    img

  4. The next screen will let you preview how Refine will parse the file. If you're using the Iowa FEC contributors CSV file, you everything should be parsed as expected and you can click the Create Project » button in the top-right of the screen:

    img

  5. After a bit of processing, you'll be taken to data, ready to explore and clean:

    img

Exploring the data panel

The screen is divided into two panels. On the left is the Facet/Filter panel, which we'll cover in the next section. The wide panel on the right contains a view of the imported data, similar to a spreadsheet:

img

One thing should stick out to you: Only 10 rows at a time are shown. This can be changed to show up to 50 rows, but you still have to click through the pagination to all the pages. So, in other words, exploring the data in Refine is not as easy or effortless as it is in a spreadsheet.

That may be a disadvantage, but we'll see soon that scrolling through lots of data rows is not the intended use case for Refine (or for the database GUIs we've so far used).

Editing cells

Changing the value of a cell is pretty straight-forward. Just hover the mouse over the cell until you see the word Edit pop-up. Click on that, and a new dialog box will appear:

img

Refine offers us two choices of alteration: Altering just the value of that single cell, or, Apply to All Identical Cells. In this case, if we wanted to change UNITED STATES ARMY to U.S. ARMY and selected Apply to All Identical Cells, it would be as if we did a Find and Replace All action for the UNITED STATES ARMY value inside the contrib_employer column. Handy!

Column headers

img

The dropdown menus, which can be accessed by clicking the little down arrow for each column header, is the quickest way to access Refine's most commonly used powerful features. You can also do things such as rename/remove/move columns, but Refine really isn't about rearranging your data columns.

For the purposes of this tutorial (and the next one, about clustering data), we care about the Facet submenu and the Text filter command.

The Facet and Filter panel

Let's now turn our attention to the left-side panel, which I'll refer to as the Facet/Filter panel. Currently it's empty. So in the data panel, click on the column dropdown for contrib_city and select Text filter:

img

Selecting that will pop open a new subpanel on the Facet/Filter panel:

img

Start typing in a term, such as CEDAR. The data panel will dynamically update. Also, note that in the top-left of the data_panel, the number of active rows will change. The dataset has 28,553 rows. But when you filter the contrib_city by CEDAR, the data panel will show only 2,010 matching rows:

img

Combining text filters

Let's add another Text Filter for the column of contbr_employer. Then type in something like ROCKWELL. The data panel will now show just 91 matching rows, in other words, only 91 rows have a city with the word CEDAR in it and a contrib_emplyer with the word ROCKWELL in it.

img

Clearing the filter subpanels

If you erase the filter terms, the data panel will update to show the matching rows. You can also click the little X box to remove the filter panels. Removing both panels will result in the data panel showing all 28,553 rows again.

It's important to note that this filtering is not destructive. Rows don't get deleted in Refine unless you explicitly delete them – and in general, you should never delete data in Refine…it's always better just to hide/show rows as needed, because undeleting things is a painful process.

Text filters in Refine can be thought of as doing a WHERE statement in SQL with multiple AND conditions, though for small datasets, it's certainly much faster to use Refine's interactive GUI. Sometimes I'll use Refine solely for quick filtering of datasets (note: it will be a bit clunky for datasets of a million rows and larger).

Text facet

Text filters is fun, but where Refine really shines is with its faceting capabilities. In the data panel, click the dropdown for contrib_city, then select Facet > Text facet

img

A new subpanel will show up on the left that contains a list of all unique terms that exist in the city column, along with how many occurrences there are. In SQL, this would be like doing the following aggregation query:

SELECT contrib_city, COUNT(*)
FROM contributions
GROUP BY contrib_city
ORDER BY contrib_city

The panel can be sorted by count of the terms, and it will conveniently display the number of unique choices (617). Already we can see how messy that contributions data is: 43RD ST NE is probably not the name of a city in Iowa:

img

Click on any one of the facets; the data panel will, as it did with text filters, immediately refresh to show all rows that have the selected city. You can also include multiple cities; in effect, this is another way of doing a text filter on multiple cities:

img

Multiple text facets

Click the dropdown for contbr_employer and add a Text facet. Just as with filters, we can combine numerous text facets. If you didn't reset the facet for contrib_city, then the facet count for contbr_employer will reflect only the facets that appear in the matching rows.

So in the example below, where I've selected IOWA CITY and WEST DES MOINES as the active city facets, there are 463 different employer names:

img

Conclusion

There's a lot more to Refine, but at least we know now how to quickly browse and explore the data. In the lesson on clustering, we'll see how Refine uses powerful text-manipulation algorithms to help us clean up the messiness of hand-entered data.

Other resources