Cleaning Data with [r]egex

Any important part of working with data is the ability to manipulate into useful data. Whether that’s cleaning it up, or transforming the table so it can be easier to use. Cleaning data is usually the first step in any data science project, and can be where you spend most of your time. How else can you get all the variables you need to create a world class algorithm if you can’t even plug in the right stuff? Here I’ll explore the first part in a two part series on how to clean data. I’ll cover the basics of regular-expression or REGEX for short — for all those wishing to nitpick and demand I pronounce it REGEXP you can please exit left now.

REGEX is a sequence of characters that define a search pattern. We can have simple searches know as literals which are just words or numbers, or alternations which performs a regex expression using the pipe symbol (|) much like an ‘or’ statement.

Character Sets [] allows us to add possibilities in a regex. Place into them the letter or number you wish to search. The search will consider the characters you placed in them as a possibility when returning a search query. E.g. col[ou]r will return either ‘color’ or ‘colour’. You can also use ranges within character sets by adding a dash, so that [a-c]at will return: ‘aat’, ‘bat’, ‘cat’.

If there are words you’d like to exclude from your search the caret symbol ^ can be inserted. E.g. [^a-k]ow will return every word that starts after the letter ‘k’ and is followed by ‘ow’ like ‘low’, ‘mow’, ‘now’, ‘oow’, ‘pow’, etc.

Wildcards are placeholders that could be any character, and denoted as a period. These could also be like a range but includes numbers or symbols. Since sometimes you could be searching for an actual period it’s important to remember the escape character (\) backslash. E.g. …\. will return any three characters followed by a period: ‘cat.’, ‘dow.’, ‘now.’, and so on.

Woof that’s quite a bit of REGEX to use if you’re looking for longer keywords. Thank goodness for the shorthand character classes.

\w: can be used in replacement of [A-za-z0–9_].

\d: for [0–9].

\s: for [\t\r\n\f\v] (white space, tab, return, line-break, form-feed, vertical tab).

To search the negated properties simply use the capitalization version of each. Such as ‘\D’ for anything not a digit.

Grouping can be done in parentheses (). I know earlier I used the pipe symbol | for or statements, but it’s important to remember it only works on connected words and not whole sentences. E.g. “I love ice-cream|cake” will either return “I love ice-cream” or “cake”. “I love (ice-cream|cake)” meanwhile will return either full sentence.

Fixed quantifiers {} gives us a number range to work with when searching key characters. E.g. wow{2,4} will return us the word ‘wow’ from the range of two ‘w’ up to four ‘w’ — woww, wowww, or wowwww. You could be more specific and just query wow{4} for four ‘w’. It’s important to remember that fixed quantifiers are considered greedy and will only return the query with the largest number of ‘w’ when possible.

Optional quantifiers ? will take the character it is next too as an optional- similar to col[ou]r using the character set, except this would look like: colou?r. It’s more handy in groupings such as: The monkey ate a (rotten)? banana. — which will return either;

The monkey ate a banana.

or

The monkey ate a rotten banana.

The multiple quantifier * is a lot like optional quantifier except it can be used from 0 up to multiple characters. E.g. meo*w will return: mew, meow, meoow, meooow, etc. However if you want it to have at lease one O use the + symbol.

The last REGEX properties I’d like to tell you about are the anchor tags ^ and $. The ^ will anchor queries to the beginning of a string and $ will anchor them to the end. E.g. ^my name is Abel$ will not return ‘Hello my name is Abel Garrido$ - these are for very specific searches and can be very useful when tracing down certain bits of code.

That’s most of the regex a beginner would need to know. You can see how this can be handy in a coder’s life. When handling large sets of data it’s important to be able to find what you need otherwise you’re blindly digging for that needle in the haystack, and we just don’t have time for that. Work smarter, not harder fellow coders.

I’m a web developer, and data scientist by hobby. Yes, it can be a hobby. I blog about all things code.