From Large to Small: Transforming Big Data into Usable Topics (Web Exclusive)

(Photo by MC3 Josue L. Escobosa)

Much of the mountains of data the Navy receives is never looked at by human eyes. How do we get the vital data we need and analyze it efficiently? (Photo by MC3 Josue L. Escobosa)

By Dr. Lawrence Carin

The Navy is increasingly challenged by the massive quantity of data it measures and analyzes. Measurement is daunting by itself. Conventional cameras, for example, only measure three principal wavelengths or colors (red, green, and blue), but hyperspectral cameras can measure images at hundreds of different wavelengths, with a separate image measured for each one. Analyzing information such as this is, if anything, even more challenging than collecting it. Once high-quantity data are measured, the process of analyzing them is very time consuming. As a result, a large fraction of Navy data is not looked at by anyone. It is therefore desirable to develop new mathematical and statistical methods that efficiently and accurately analyze massive quantities of data and summarize it in a way that it is understandable to decision makers and warfighters.

Both of these challenges—high data volume and an analysis bottleneck—are being addressed by a new class of mathematics and statistics being developed by the Office of Naval Research (ONR). While the data are inherently high-dimensional, they typically may be well represented by models with a small (low-dimensional) number of parameters. For instance, a year’s worth of newspapers is a high-dimensional set of data (potentially thousands of pages of information). We break that information down into component (low-dimensional) categories such as the sports, life, local, or world news sections. With the raw data collected by the Navy, however, the low-dimensional model is typically unknown ahead of time and must be inferred by the data itself. This concept of a low-dimensional data structure is at the heart of modern compression techniques, which make the Internet, mobile phones, and satellite-based communications and transmission possible. Most natural data (e.g., images, video, and audio) typically may be represented in terms of weighted sums of basis functions (which are the fundamental elements from which all signals of interest may be represented). The key understanding from the past several decades is that, for most natural signals, only a small number of basis functions are needed to represent data accurately.

One of the first forms of basis functions to be developed by ONR is called a “wavelet,” which is a small wave-like signal characterized by a series of scales or resolutions. These basis functions, and others like them, are used in widely employed compression standards. Wavelets may be applied to almost all natural data (images, audio, video, radar and sonar signals, etc.), and this generality makes them valuable tools for compression. By building basis functions that are specifically designed for certain classes of data, however, even more compression is possible. Recently, ONR has funded mathematics, statistics, and machine learning projects that have developed a new class of basis functions that are tailored to the specifics of particular types of signals (this field is called dictionary learning). In the context of document analysis, for example, these dictionary elements correspond to “topics,” and each document of interest may be summarized in terms of a compact set of topics. This is an example of the aforementioned type of low-dimensional representation: a document, which is a large set of words, may be summarized compactly in terms of a small collection of topics. The topics are tailored to the particular class of document of interest, providing a means of summarizing a massive collection of documents in terms of a relatively small set of topics.

Having developed the mathematics of low-dimensional representations for both compression and data analyses and summarization, this concept has recently been used to revolutionize the manner in which data are measured in the first place. Conventional measurement systems typically collect data one pixel at a time. This concept of measurement, which has guided sensor development for more than 100 years, is based on the idea that what comes out of the sensor should be immediately interpretable by a human. A new field, called “compressive sensing,” that uses low-dimensional representations for most natural signals of interest dictates that the measurement’s optimal form is no longer performed one pixel at a time. Consequently, what comes out of the sensor no longer looks like something a human typically analyzes. The advantage of performing measurements in this manner is that the quantity of data that need to be measured is much smaller than that required by conventional pixel-based systems. For a human to examine the data, a decompression step is required based on mathematics developed by ONR. Compressive sensing reduces the quantity of data that need to be measured in the first place, and it leverages modern computing and sophisticated mathematics to decompress quickly and accurately.

The same methods that make compressive sensing possible (low-dimensional mathematical representations of natural data) also are used to address the Navy’s analysis challenge. Methods such as dictionary learning and topic modeling are now being used to analyze massive quantities of data. Parallel computing platforms and the cloud make these computations particularly efficient and accurate.

The heart of efficient measurement and analysis of massive-scale data, as well as methods for modern compression technology, is the concept that data typically may be represented compactly in terms of a relatively small number of basis functions (or dictionary elements/topics). If one has multiple types of data used to characterize the same system or environment (e.g., text, audio, and imagery from a given scene), these multiple types of data may be viewed as different but related representations of the same underlying scene. Consequently, each of these data types, despite the fact that they are quite different in detail, may be analyzed jointly, allowing one to learn a shared latent structure. This shared low-dimensional representation provides a compact summary of massive data, aiding human understanding and analysis.

To give a sense of how low-dimensional mathematical structure may be employed to infer meaning from large-scale data, it’s helpful to look at a couple of examples. In the first example, we consider all State of the Union addresses given by U.S. presidents. From more than 200 documents, we learn a low-dimensional set of topics characteristic of these documents, where each topic is defined by the probability of observing each of the words in a dictionary. Each document is characterized by the probability that certain topics are present, and each topic is a distribution over the words in the vocabulary. Collectively, the topics associated with a given document characterize the observed words. Since the number of topics is much smaller than the number of words in the vocabulary, and each document is characterized by a learned probability distribution over these topics, significant compression is realized. In addition, each document is summarized succinctly in terms of a small collection of relevant topics. In Figure 1 we depict the strength with which one inferred topic (banking) is manifested, over the lifetime of the United States, based on the State of the Union documents. Note that several important periods in American history are revealed using this analysis.

Figure 1. Depiction of the inferred importance of a topic connected with banking, as a function of year, using presidential State of the Union addresses. The larger the value, the more relevant this topic in that year. Several important events relevant to banking in the United States are highlighted.

Figure 1. Depiction of the inferred importance of a topic connected with banking, as a function of year, using presidential State of the Union addresses. The larger the value, the more relevant this topic in that year. Several important events relevant to banking in the United States are highlighted.

In another example, as a demonstration of the ability of such mathematics to scale to massive data, we have applied the topic model to analyze every article on the English Wikipedia website, from which we infer many important topics (each topic is characterized by a small subset of words that occur frequently). Each Wikipedia page is then summarized in terms of a concise set of topics. In Figure 2 we show several example topics inferred from the Wikipedia data, where a given topic is characterized by a set of important (highly probable) words, which are shown. In Figure 2 we show six example topics from hundreds that were uncovered. The Wikipedia data is of massive scale, but by using ONR-developed statistical methods, this collection can be analyzed in hours on a personal computer.

Figure 2. Six example topics inferred from the English Wikipedia database. Each topic is characterized by a relatively small set of probable words, as depicted here.

Figure 2. Six example topics inferred from the English Wikipedia database. Each topic is characterized by a relatively small set of probable words, as depicted here.

ONR-sponsored mathematics is transforming the way data are measured and analyzed, leveraging the power of modern computing platforms. These developments are critical for enabling Navy decision makers to extract actionable insights from massive and disparate information sources. In addition, this research has many dual usages with the developed math also being applied in compression methods employed in wireless phones and the Internet, as well as to analysis of the complex data collected in many medical centers.

About the Author:

Dr. Carin is the vice provost for research at Duke University. He has been funded by the Office of Naval Research for almost 20 years, to develop applied statistics and machine learning tools for Navy applications.