Sifting Rapidly Through Petabytes of Data

By Roland Piquepaille

Searching within large databases has never been easy. But when it comes to physics, and especially to experiments with particle colliders, the task becomes extremely difficult. You have to look at hundreds of millions of particle collisions to isolate only a few dozens of interest. And when you realize that all these individual records are stored in data files and systems scattered all over the world, it becomes clear that the search process is a tough challenge to crack. But now, a technology known as the Word-Aligned Hybrid (WAH) compression method and developed at Lawrence Berkeley National Laboratory (BNL), is dramatically speeding up the searching process. For example, it took only 15 minutes to retrieve 80 events recorded in 2001 and hidden like needles in a haystack of information inside petabytes of data. But read more...

Here is how BNL describes how the WAH method is used.

WAH is currently used in a software package called FastBit to compress bitmap indexes. A bitmap index is a method of reducing the response time of queries involving common types of conditions in data objects, such as "state = CA" and "age >= 21." It achieves this by storing certain pre-computed answers as bitmaps. For example, a bitmap index for "state" might have one bitmap for each state in the U.S. Because computers can manipulate bitmaps efficiently, bitmap indices are efficient in searching for interesting records in large datasets.
WAH compression makes the bitmap index optimal in terms of computational complexity. A small number of the most efficient indexing schemes have this optimality property. What makes the new technology unique is that WAH-compressed indexes significantly outperform other schemes in tests.
"In tests conducted using actual data from high-energy physics experiments, we confirmed that our FastBit software is an order of magnitude faster than the best-known bitmap indexing schemes on average," according to John Wu, the lead developer of FastBit.

Of course, the key here was to build the compressed indexing system.

A number of specialized compression schemes have been proposed to process compressed indexes efficiently, with the best-known one called the Byte-aligned Bitmap Code (BBC).
The goal of the Berkeley Lab project was to create an indexing system that could be compressed and at the same time offers much faster searches than existing methods. To achieve this goal, the WAH compression scheme was developed. While WAH-compressed indexes are slightly larger than BBC-compressed indexes, the time needed to process a query is less, often much less.

Now, let's look in more details at the Grid Collector, the software used to analyze the petabytes of data generated each year by the STAR (Solenoidal Tracker at RHIC) high-energy physics experiment.

First, here is a link to a paper called "The Grid Collector: Using an Event Catalog to Speedup User Analysis in Distributed Environment" (PDF format, 4 pages, 251 KB).

Then, this research work will be presented next June at the International Supercomputer Conference in Heidelberg, Germany, where it was selected as one of the three best papers.

Here is a link to the abstract of this paper named "Grid Collector: Facilitating Efficient Selective Access from Data Grids." Below is an excerpt.

Since most analysis jobs filter out significant number of events, a considerable amount of time is wasted by reading the unwanted events. The Grid Collector removes this inefficiency by allowing users to specify more precisely what events are of interest and to read only the selected events. This speeds up most analysis jobs. In existing analysis frameworks, the responsibility of bringing files from tertiary storage to disk falls on the users. This forces most of analysis jobs to be performed at centralized computer facilities where commonly used files are kept on disks.

Finally, the researchers have filed an application for a U.S. patent which was granted on December 14, 2004 under the name "Word aligned bitmap compression method, data structure, and apparatus." And here is a link to this patent number 6,831,575.

Sources: Lawrence Berkeley National Laboratory news release, May 16, 2005; and various websites

Related stories can be found in the following categories.


Famous quotes containing the words rapidly and/or data:

    What is the first thing that savage tribes accept from Europeans nowadays? Brandy and Christianity, the European narcotics.—And what is it that most rapidly leads to their destruction?—The European narcotics.
    Friedrich Nietzsche (1844–1900)

    To write it, it took three months; to conceive it three minutes; to collect the data in it—all my life.
    F. Scott Fitzgerald (1896–1940)