Share |

Getting value from a trillion electron haystack

Image of plot map of billions of particles.

After querying a dataset of 114,875,956,837 particles for those with energy values less than 1.5, FastQuery identified 57,740,614 particles, which are mapped on this plot. Image courtesy Oliver Rubel, Berkeley Lab.

Modern research tools like high-performance computers and particle colliders are generating so much data, so quickly, that many scientists fear they will not be able to keep up with the deluge. Now, for the first time, Berkeley researchers have designed strategies for extracting interesting data from massive scientific datasets, and queried 32 terabytes of a trillion particle dataset in three seconds.

“These instruments are capable of answering some of our most fundamental scientific questions, but it is all for nothing if we can’t get a handle on the data and make sense of it,” said Surendra Byna of the Lawrence Berkeley National Laboratory’s (Berkeley Lab’s) Computational Research Division (CRD).

That’s why researchers from Berkeley Lab’s CRD, the University of California, San Diego (UCSD), Los Alamos National Laboratory, Tsinghua University, and Brown University teamed up to develop software strategies for storing, mining, and analyzing massive datasets - specifically, for data generated by a state-of-the-art plasma physics code called VPIC.

When the team ran VPIC on the Department of Energy’s National Energy Research Scientific Computing Center’s (NERSC’s) Cray XE6 ’Hopper’ high-performance computer, they generated a 3D dataset of a trillion particles to better understand magnetic reconnection in particles. Magnetic reconnection is a physical process where magnetic topology is rearranged, and magnetic energy is converted into kinetic energy, thermal energy, and particle acceleration.

VPIC simulated the process in thousands of time-steps, periodically writing a 32 terabyte file – which is five times more than the world’s largest library, the US Library of Congress – to disk at specified times. Each time-step was a frame in the bigger simulation.

Every 32 terabyte file was written to disk in about 20 minutes, at a sustained rate of 27 gigabytes per second. By applying an enhanced version of FastQuery, an information query language for large and complex data, they indexed this massive dataset in about 10 minutes, and then queried it in three seconds for interesting features to visualize.

Trillions of particles require exascale computing

According to Homa Karimabadi, who leads the space physics group at UCSD, one of the major unsolved mysteries in magnetic reconnection is the conditions and details of how energetic particles are generated. Until recently, the closest that anybody had come to studying this was by looking at 2D simulations.

“To answer these questions we need to take into full account additional effects such as flux-rope interactions and resulting turbulence that occur in 3D simulations,” Karimabadi said. “But, as we add another dimension, the number of particles in our simulations grows from billions to trillions. And it is impossible to pull up a trillion-particle dataset on your computer screen.”

To address these challenges, Karimabadi joined forces with the ExaHDF5 team, a DOE funded collaboration to develop high-performance computation input and output, and analysis strategies for future exascale computers.

A scalable storage approach for a successful search

According to Byna, VPIC models magnetic reconnection by breaking down the ‘big picture’ into pieces, each of which are assigned, using a Message Passing Interface (MPI), to a group of processors to compute. 

In the original implementation of VPIC, each MPI domain generates a binary file once it finishes processing its piece. One major limitation to this approach is that the number of files generated for large simulations become unwieldy. The team’s largest VPIC run contained about 20,000 MPI domains - 20,000 binary files per time-step.

“It takes a really long time to perform a simple Linux search of a 20,000-file directory. Ultimately, these limitations become a bottleneck to scientific analysis and discovery,” Byna said.

By incorporating a high-performance parallel data interface to HDF5, called H5Part code, into the VPIC codebase, the team overcame all of challenges. This modification creates one shared HDF5 file per time-step, instead of 20,000 independent binary files. Because most visualization tools use HDF5 files, this eliminates the need to re-format the data.

Data mining made easier with FastQuery

Image of visualization of 1 trillion-electron dataset.

This is a visualization of the 1 trillion-electron dataset at timestep 1905. All of the particles with energy > 1.3 are shown in grey, while particles with energy > 1.5 are shown in color. A total of 164,856,597 particles with energy > 1.3 and 423,998 particles with energy > 1.5 appear to be accelerated preferentially along the direction of the mean magnetic field, corresponding to the formation of four jets. Image courtesy Oliver Rubel, Berkeley Lab.

Once the information had been stored, the next challenge was making sense of it. On this front, team members implemented an enhanced version of FastQuery. Using this tool, they indexed the 32 terabyte dataset in about 10 minutes, and queried it in three seconds. This was the first time a trillion-particle dataset had been queried this quickly.

The team accelerated FastQuery’s capabilities by implementing a hierarchical load-balancing strategy. Because FastQuery is built on FastBit indexing technology, researchers can search their data based on an arbitrary range of conditions that are defined by available data values. This means researchers can search a trillion particle dataset and sift out electrons by their energy values.

This capability also helps with visualization. Because most computer displays contain only a few million pixels, it’s impossible to render a dataset with trillions of particles. Now, researchers can use FastQuery to identify particles of interest to render.

Karimabadi said, “Although our VPIC runs typically generate two types of data — grid and particle — we never did a whole lot with the particle data because it was really hard to extract information from a trillion particle dataset.”

A version of this story first appeared on Berkeley Lab’s Computational Research Division website.

Your rating: None Average: 4 (3 votes)

Comments

where magnetic topology is

where magnetic topology is rearranged, and magnetic energy is converted into kinetic energy, thermal energy, and particle acceleration. plumbing repair nj

I represent your website

I represent your website untdo for my needs. It contains wonderful and utile posts. I propeller pretense most of them and got a lot from them.
http://www.vapelikeaboss.com

whether two proteins are

whether two proteins are likely to interact or not, or likely to repulse each other depending on the hermes outlet physical or chemical properties of the amino acids on the surface, once you combine it with knowledge of where the interaction site lies,” says Carbone

The team accelerated

The team accelerated FastQuery’s capabilities by implementing a hierarchical load-balancing Health here strategy. Because FastQuery is built on FastBit indexing technology, researchers can search their data based on an arbitrary range of conditions that are defined by available data values. This means researchers can search a trillion particle dataset and sift out electrons by their energy values.

Informative thoughts

Ang Mo Kio Cluster House is a new and upcoming cluster housing located in the Ang Mo Kio area, nested right in the Ang Mo Kio landed area. It is within a short drive to Little India, Orchard and city area. With expected completion in mid 2016, it comprises of 118 units in total with 100 units of terrace and 18 units of Semi-D.
Ang Mo Kio Cluster House

You actually make it seem so

You actually make it seem so easy with your presentation but I find this matter to be actually something which I think I would never understand. It seems too complicated and extremely broad for me. I'm looking forward for your next post, I’ll try to get the hang of it! shoutcast

According to Laplace, this

According to Laplace, this description is useful to teach the public about her research. "Talking about mega Kado Unik or petabytes Kosmetik Online Herbalife is Furniture Jakarta Timur not meaningful Kitchen Set Murah for everybody, but talking about Cetak Yasin hours that you Kerudung Modis would spend listening to music is.” Since Cetak Yasin making this comment, Laplace's research colleagues have told her how much they like her comparison.

hese voters can be

hese voters can be particularly affected by the significant costs of the documentation required to obtain a photo ID. Birth certificates can cost between $8 and $25.VPN service

Fantastic article, this is so

Fantastic article, this is so well explained that even my kids could follow it, thanks and keep up the great work!
ads dating.

Your post is really

Your post is really informative,and i like your post a lot.south by southwest volunteer

I represent your website

I represent your website untdo for my needs. It contains wonderful and utile posts. I propeller pretense most of them and got a lot from them.
Eugene Taxi

Very interesting and useful

Very interesting and useful discussion for me. I think this is a common problem for all parents so don't worry about it. Thanks guys for such a great comments, have a good luck game online

Very in depth article however

...all of this is too complicated for me, on daily basis I deal with mortgage repayments calculator website and its enough for me.

I represent your website

I represent your website untdo for my needs. It contains wonderful and utile posts. I propeller pretense most of them and got a lot from them.
donna cerca uomo milano.

When it comes down to being

When it comes down to being eco-friendly the electric model wins simply because it is not powered by petrol, so really think hard about the type of work you will be needing your chainsaw for before buying it - if its going to be small domestic jobs - buy small and electric. Taking note of all these things will help to keep chainsaw's blade from flying back at you. Best chainsaw A blunt chainsaw is an inefficient tool - it won't cut effectively and it will take a lot more effort to get the cuts you want. Chainsaws in a domestic environmentPetrol chainsaws however do come in various sizes and are regularly used in the domestic situation.

According to Homa Karimabadi,

According to Homa Karimabadi, Usługi transportowe Wrocław who leads the space physics group at UCSD, one of the major unsolved edukacja mysteries in magnetic reconnection is the conditions and details dom of how energetic particles are generated. Until recently, the closest that anybody had come to studying this was by looking at 2D simulations.
marketing
“To answer these questions AGD i RTV we need to take into full account additional effects such as flux-rope interactions and resulting turbulence that occur in 3D simulations,” kultura i sztuka Karimabadi said. “But, as we add another dimension, the number of particles in our simulations grows from billions to trillions. And it is impossible to pull internet i komputery up a trillion-particle dataset on your computer screen.”

To address these challenges, Karimabadi joined forces with the ExaHDF5 team, a DOE funded collaboration to develop high-performance computation input and output, and analysis strategies for future exascale computers.

Post new comment

By submitting this form, you accept the Mollom privacy policy.