An Old Tool with Enduring Value: Using Excel to Prepare Data for Analysis
Smith, G. A. (2018, February). An old tool with enduring value: Using Excel to prepare data for analysis. A workshop presented at the Wrangling Library Data virtual conference.
Microsoft Excel was first released on the Windows platform 30 years ago and has since become widely used. Although new tools for manipulating, analyzing, and visualizing data are constantly emerging, Excel remains a potent tool—and not just because of newer features. Simple functions such as TRIM, MID, SUBSTITUTE, FIND, ROUNDDOWN, and VLOOKUP can be used to manipulate data sets in powerful ways. This workshop applies selected functions to realistic library data sets. Demonstrations include: deriving time-series categories from date and time stamps pre-coding survey comments based on keywords dealing with messy data points such as call numbers and publisher names removing extraneous punctuation and symbols from textual data identifying keywords that are misspelled or on a stopword list Spreadsheets available for download provide simple utilities for performing these tasks. For those interested in creating or modifying formulas, presentation slides illustrate the basics of using selected functions to transform data in Excel.
The premises of this conference presentation were pretty simple:
- Data sets need to be prepared for analysis. The data we’re called on to analyze typically need stuff done to them. Particularized data need to be grouped into meaningful categories. Free text needs to be broken down into keywords. Dirty data need to be cleansed and other irregularities need to be dealt with. In sum, signals need to be distinguished from noise.
- Data manipulation tools more potent than Microsoft Excel are available, but I find that …
- I’m already using Excel frequently.
- I’m often far from exhausting its capabilities.
- It’s easier for me to learn new tricks in a familiar environment than in an unfamiliar one.
With those thoughts in mind, I set out to share some of the data manipulation tricks that I’ve learned to use, mostly with library-oriented data, over the years. Specifically, I showed how I’ve used Excel formulas to perform the following tasks:
- deriving time-series categories from date and time stamps
- pre-coding survey comments based on keywords
- dealing with messy data points such as call numbers and publisher names
- removing extraneous punctuation and symbols from textual data
- identifying keywords that are misspelled or on a stopword list
I offered presentation viewers eight utility files—that is, spreadsheets where one can paste a column of data and see it transformed into something more conducive to insightful analysis. The files that I demonstrated, along with my slide deck, can be downloaded via the link above.