Mining for gold in unstructured text
There's tremendous value in all the text floating around the internet, from news, tweets and blogs to all kinds of documents posted on your intranet. You can gain an edge if you can mine and structure this data, but how do you do it? In this blog post, Sven Anders lays out the most common approaches.
Most of the textual data available today exist in unstructured form, and the amount of new information posted is on a scale that beggars belief. On Facebook and Twitter alone, more than a billion pieces of information are produced and shared every day.
This information can be of great value if you can find a way to structure the content you care about and make it searchable.
The question is, how do you convert this unstructured text to structured form?
First of all, you need to figure out what you are interested in learning. You cannot simply parse every piece of textual information and expect to end up with something usable. You need to define what is relevant to you and then extract only this information from the unstructured text.
This area of research is called information extraction and is an important task for understanding natural languages programmatically and making sense of unstructured textual data.
Start with named entity extraction
Typically, you would start by using named entity extraction, which includes named entity recognition, coreference resolution, and relationship extraction.
The named entity recognition task is a set of techniques that help identify all mentions of predefined named entities in text, typically persons, organizations, and locations. Let's say you have before you a collection of news stories, and you want to structure this information and mine out things of interest. First, you need to do a boundary detection to find out where the relevant text starts and ends. Once you understand this, then you need to classify the named entity types in which you are interested.
If it is a new story, the fields of interests are the people, the organizations or geopolitical entities. If the story mentions a white house, the capitalization will let you know if it means the presidential residence or the American government, or simply just a white house. Likewise, if the story refers to 10 Down Street, the meaning is usually not the actual address 10 Downing Street.
In a financial item, you are talking about money, monetary values, stock prices of companies and so on. In a medical story, you are talking about diseases, drugs, viruses and so on. The names can be very diverse, so the rules need to stay updated based on the current domain of the text.
Certain words can have different meaning depending on the topic, so if the text refers to Chicago, you know that it could be a place, a pop group, a music album or even a font. Depending on what kind of variation it is, you need to know what label you want to assign to this word.
When you identified the named entities, you want to find their relations.
Find the relations and disambiguate
Relations are what happened to who, when, where and so on. To get semantically meaningful results out of a piece of text you need to identify whom it is talking about, the time or date it refers to, the place it occurs, or the relationship between the named entities in the text.
One of the hurdles when parsing text is disambiguating mentions in text and group them if they refer to the same entity. For instance, if you have a text saying "John meets Olivia," you can easily tell that it contains two separate entities. If the next sentence is using pronouns such as "He surprised her with a rose," then you need to resolve the pronouns so that you can group "He" and "John" together as the same entity and so on. You need both named entity recognition and relation extraction to do co-reference resolution so that you can group these.
Apply the right technique
The difficulty in information extraction is choosing the right features and labels to for the extraction task. The underlying technology and algorithms are relatively generic.
If you have well-defined data, for instance, fields like dates and phone numbers, you can use regular expressions to filter and parse the information. For other fields, it is common to use a machine learning approach where you set up a learning model that you then train over large datasets.
When building and training a model, it is beneficial and often necessary to use a supervised approach where you can inspect the efficiency of the model that you are making and fine tune it to get the results that you want.
What can you do with this information?
There are lots of business cases related to extracting information from bodies of text. For instances, named entities can be indexed and linked, which is very valuable for many organizations, especially ones that do business in areas related to search. Media organizations can also benefit from information extraction when they research news stories.
Sentiment analysis can be attributed to companies, products, and persons. You can analyze and even find potential new up- or downtrends and so gain a valuable market edge.
Text mining is all about mining for information, and while it is certainly tricky and challenging, applying this technology to your information stream can yield tremendous value.