Share |

Following the 'red brick road' to data management

The Yellow and Red Brick Roads from the 1939 Wizard of Oz film.

Instead of science and industry following the traditional 'Yellow Brick Road' of relational databases, there may be another, faster path to the Emerald City of better data management. Image courtesy Warner Bros.

For the last 40 years, the way that large-scale services, such as global banks, and scientific experiments, like the LHC at CERN, have been managing their data has been reminiscent of Lyman Frank Baum’s The Wonderful Wizard of Oz.In the fairytale, Dorothy asks how to get to the Emerald City to see the Wizard of Oz, and is simply told that, “all you do is follow the Yellow Brick Road.” Any time she strays from the Yellow Brick Road, Dorothy and her friends encounter serious danger and eventually return to follow the safer road.

Relational databases are the Yellow Brick Road of managing large structured data globally. They are the most popular form of database scheme and have been commonly used since the 1970s. While other types of databases have been built, none have been as effective. 

In relational databases, data is organized in the form of related tables: each table can have many records, and these records can have many data fields. This data can be accessed and added without having to reorganize the tables. The software interface used to build and access data structures within relational databases is Structured Query Language (SQL). It’s the most widely known and respected query language used today and the closest thing to a standard in the database world.

Up until now, for organizations, this has been a happy marriage of data storage and access. But, there are growing doubts that relational databases can handle the ‘data deluge’ experienced by the likes of growing web companies and the transition into eScience.

In the 1939 film of The Wizard of Oz, a red brick road is intertwined with the yellow one. Similarly, a new type of database might soon offer a different path: NoSQL, or Not-Only-SQL, first coined in 2008, is promising a faster and more scalable database architecture, at least for some cases. It comes in many different implementations, such as Cassandra, MongoDB, and CouchDB to name just a few. Plus, NoSQL query languages are being developed that are easier to learn than SQL.

Big science and web giants such as Google are looking at NoSQL as the next step in the evolution of database models. Its arrival could shake up the market and replace existing technology within a year – or its arrival might be entirely for nothing. “No one knows yet if it will be a disruptive technology for us,” said Tony Cass, leader of the database services group at CERN.

Partitions, paradoxes, and particles

The theory that underpins all distributed systems is known as the CAP (Consistency, Availability and Partition Tolerance) theorem, proposed by computer scientist Eric A Brewer in 2000.

According to CAP, a distributed system should satisfy three crucial factors: all nodes on a system see the same data simultaneously (consistency); data requests receive a reply, whether they were successful or failed (availability); and that a system continues to operate in spite of random message loss (partition tolerance).

Brewer’s crucial point is that it’s impossible to meet all three criteria simultaneously; at most, two of the three can be met. (An analogy could be Heisenberg’s famous uncertainty principle in physics, which states that it is impossible for an observer to know a particle’s position and momentum simultaneously.) So, a trade-off has to be made between consistency and availability.

Is one database better than the other?

Different data management approaches suit certain goals; one decision that companies and organizations have is whether to make their data services consistent.

Consider a multinational bank with millions of customers: when a customer completes a transaction on their account, all their account information located in databases around the world should be updated instantly. This is typically part of the ACID concept (atomicity, consistency isolation, durability), which lists the properties that guarantee a database transaction is completed successfully, ensuring banking transactions run smoothly.

The other choice is to sacrifice this concept to ensure data services perform quickly for users. For example, Google’s search engine produces results in a fraction of a second. Prioritizing speed of availability slows the consistency of a system if it has thousands or even millions of simultaneous search queries and data are being written to nodes on the network – especially if the nodes are spread out between cities and countries.

This concept of prioritizing availability or performance is known as BASE or ‘basically available, soft state, eventually consistent’ and typical of NoSQL architectures.

A new mindset

Night shots of the Autobahn Road.

The Materials Project has taken the NoSQL road to managing their data. Image courtesy Michael Faes.

One successful implementation of a NoSQL database is in the Materials Project, a new scientific tool that uses the open source MongoDB framework to be the ‘Google’ of material properties, according to its creators at the Lawrence Berkeley National Laboratory in the USA. The project will provide scientists with a resource to quickly develop new, clean energy technologies, and enable researchers to analyze and query material properties, such as developing new batteries. 

“Since we don't always know what properties we need ahead of time, it becomes useful to have a flexible schema that allows us to add properties to objects. In this case, objects represent materials. It also allows us easily to attach certain properties to a set of materials only where they make sense, for example only certain materials will have electrochemical properties,” said Shreyas Cholia, a computer engineer at the US National Energy Research Scientific Computing Center (NERSC) involved in the project.

“We use a dynamic JSON based query language [a text-based open standard, based on JavaScript designed for human-readable data interchange], that is significantly easier to use than SQL. This allows us to construct queries and think of results in terms of objects rather than in terms of relational calculus.

“This is a much cleaner way to interface with the database, rather than dealing with complex joins [a method for combining fields from two tables by using values common to each] and relationships, you construct a query object with a list of properties or ranges that you are interested in. The MongoDB query language is extremely flexible and powerful, while being more programmer friendly,” said Cholia.

Not so fast

Two men standing in the middle of the CERN computing centre.

'Big science' projects like CERN are experimenting with non-relational database models as a potential solution to analyzing their ever increasing data more quickly and efficiently. Image courtesy CERN.

While new projects, such as the Materials Project can easily consider new database models, what about the existing large-scale science projects?

According to Tony Cass, relational databases have contributed to the success of CERN today. “We have used the Oracle relational database for 30 years. Most people would probably expect this for administrative applications, but Oracle was introduced at first to support LEP [Large Electron–Positron Collider] construction and operation. Today, if Oracle doesn’t work, the LHC accelerator doesn’t work. Oracle databases also support critical elements for the LHC experiments.

“These databases, though, have been highly optimised to deliver fast performance for applications that were designed five or more years ago, and it takes time and expertise to adapt the databases for new queries. In contrast, creation of NoSQL solutions for a novel application is often very rapid,” said Cass.

At CERN, the database group is participating in some small-scale tests of NoSQL solutions with three of the four detectors on the LHC, but larger-scale tests need to be done, Cass said.

Cass said he thinks that the popularity of NoSQL is in part based on the fast turnaround for application development and not simply the technical advantages that are essential to support large scale websites, such as Facebook.

"There is nothing wrong with this," he said, "but you have to be careful when people contrast the performance of NoSQL solutions against relational implementations. Often, people are comparing optimized, small-scale NoSQL solutions against performance of the relational approach on an Oracle platform that has been tuned to deliver high-scale performance in another area.”

“It’s important to avoid picking the wrong technology based on small scale experience. Adding an index to a relational database can be painful, but reconfiguring a 150 terabyte NoSQL database is not likely to be much easier. No one as yet has done a comparison of use cases for large scale [petabytes of] data at CERN or science in general. This is true for applications that manage high volumes of data transfer. For example, we manage 3.5 trillion rows of LHC data in the database, which can be accessed via our Oracle system. Will a NoSQL solution be faster? No one knows what happens at these limits,” said Cass.

Oracle, an industry-partner of CERN, is also developing a NoSQL system for companies with large scale web applications that need to read and write large volumes of simultaneous workloads. “NoSQL offers a new mentality and has already achieved real world success stories. It powers giants like Facebook and LinkedIn,” said Charles Lamb technical consultant at Oracle.

“Our system is not relational and it extends the NoSQL model in some areas. Data is stored as a key-value pair [a pair of related objects - attribute name and value - stored with a unique collection of properties and methods], so we don’t use a table-based system,” said Lamb.

“Also, NoSQL databases typically only support single record operations and do not support transactions. We support single and multiple record operations. The main challenge we face is to ensure we can scale up to hundreds of thousands of compute nodes all operating concurrently,” said Lamb. 

If these database issues are still not clear, this amusing animationshould explain some of the main arguments. Be warned, there are copious amounts of swearing. Click on the image to watch the video. Image courtesy HighScalability.com.

Many roads

There are also new database systems on the market that do not fit neatly into the relational or NoSQL categories. Daniel Abadi, of Yale University, is chief scientist of Hadapt, a commercial company that recently attained $9.5 million in capital from Bessemer and NorWest Venture Partners. It’s developing a new scalable analytical database developed from research done at Yale University called HadoopDB.

It uses a distributed framework called Hadoop, which is used by Amazon, eBay, Facebook, and Google to create Web indexes, track user clicks, and make recommendations to customers. Hadapt combines Hadoop with parallel database techniques, such as map reduce, that can be found in modern scalable relational database systems.

“NoSQL databases are useful for large scale individual transaction models [a sequence of queries]; these are read, write, update, and delete tasks, but not for large-scale aggregations or analysis of data. Relational databases and Hadoop are still the best solutions for data analysis,” Abadi said.

“However, relational databases and Hadoop were designed for different workloads and therefore have different sets of strengths and weaknesses. Our research into the HadoopDB project has shown that it’s possible to combine the scalability, job complexity, fault tolerance, and ability to process unstructured data of Hadoop with the high performance of relational database systems on structured data.” While Hadapt is researching commercial aspects, they’re also looking at the scientific applications too such as sequence alignment in bioinformatics research.

Meanwhile, data management continues to increase in volume and complexity. For the increasing number of organizations and scientists that need solutions for data-intensive research, though, the path to take is not always clear. At least those working at the forefront can see an end to confusion.

“In a year’s time we'll see fewer polemics. We'll see a growing realisation of what is appropriate, where — including a better understanding of the different NoSQL implementations,” said Cass.

Your rating: None Average: 4.3 (15 votes)

Comments

This concept of prioritizing

This concept of prioritizing availability or performance is known as BASE or ‘basically available, soft state, eventually consistent’ and typical of NoSQL architectures. angry birds flash games

I really like the ability to

I really like the ability to automatically feed your blog into your email newsletter. However, is there a way to easily add custom content to each blast. In my email blasts, I would like to not only include the recent post on the blog, but some additional call to action. http://www.obdiag4u.com/

Thank you for posting such a

Thank you for posting such a great article! I found your website perfect for my needs. It contains wonderful and helpful posts. Keep up the good work! youtube get more views

Amazon, eBay, Facebook, and

Amazon, eBay, Facebook, and Google to create Web indexes, track user clicks, and make recommendations to customers...
Health News

social networking sites

That's an excellent analogy. It kind of reminds of the Red Queen's Hypothesis used to describe evolutionary systems.social networking sites

Thanks for the info! Very

Thanks for the info! Very interesting article, I love your perspective and involvement! Chat Random

s an excellent analogy. It

s an excellent analogy. It kind of reminds of the Red Queen's Hypothesis used to describe evolutionary sy stems. inet dsl anbieter

We are a customer oriented

We are a customer oriented organisation and we believe in putting our best foot forward in our journey to pinnacle.
Luxury apartment in gurgaon
New projects in dharuhera
New projects in faridabad

Lots of thanks for this

Lots of thanks for this post.I think it is a very good post. It helps us many away. So many many thanks for this article.
sito incontri.

The high-pressure oil pump is

The high-pressure oil pump is driven by the camshaft and produces the high-pressure oil needed to properly operate the fuel injectors. The pump is capable of producing more volume and pressure than is needed for normal operation, and the excess volume is controlled by the powertrain control module (PCM). Truck Diagnsotic Tool

That's an excellent analogy.

That's an excellent analogy. It kind of reminds of the Red Queen's Hypothesis used to describeasdasd evolutionary systems.
best binary brokers

best blog

I represent your website outdo for my needs. It contains wonderful and utile posts. I propeller pretense most of them and got a lot from them.
ads dating.

thanks very much

Took me time to read all the comments, but I really enjoyed the article. It proved to be Very helpful to me and I am sure to all the commenters here! It's always nice when you can not only be informed, but also entertained!
donna cerca uomo milano

Elizabeth mentioned that

Elizabeth mentioned that since the recent earthquake in Japan, there has been more interest in her project. Even Taiwan’s seismologists have been promised more funding for installing subterranean borehole stations to detect earthquakes. Reedy Creek Mitigation Bank

Things are very open and

Things are very open and intensely clear explanation of issues. was truly information. Your website is very beneficial. Appreciate your sharing.youtube get more views

In relational databases, data

In relational databases, data is organized in the form of related tables: each table can have many records, and these records can have many data fields. This data can be accessed and added without having to reorganize the tables. The software interface used to build and access data structures within relational databases is Structured Query Language (SQL). It’s the most widely known and respected query language used today and the closest thing to a standard in the database world.
kultura i sztuka
marketing
prawo i społeczeństwo
produkcja przemysłowa
zdrowie i uroda
katalog stron

Great Post

This is a wonderful article, Given so much info in it, These type of articles keeps the users interest in the website, and keep on sharing more ... Letalske Karte Hotel Hotel Slovenia Najem Vozil

There is so much that goes

There is so much that goes into data management so much here. So much it offers for everybody to take advantage of. So many great options here. There is nothing better here. Electric fireplaces Toronto

It kind of reminds of the Red

It kind of reminds of the Red Queen's Hypothesis used to describe evolutionary systems.
stock futures quotes

Backlinks

Labrador puppies for sale in NSW Australia. We do not have any Labrador puppies for sale at the moment. The next adorable litter of purebred Labrador puppies is expected about May 2013. Please see our page Labrador puppies video how to secure a puppy from us, if you are seriously interested.

Hello and welcome to my website about Taree NSW. It is intended to add details of various local businesses etc here, and to build a library of reviews. For those people who are not very familiar with Taree, it is located in the beautiful Manning Valley adjacent to the majestic Manning River. This is on the Mid North Coast of NSW Australia.

The world's largest marketplace for small services, starting at $5 SEOminijob.com the world's largest marketplace for small services Onlineminiwork.com the world’s largest marketplace for small services all starting at $5

Domain News

Thanks for taking the time to post such valuable information. Quality content is what always gets the visitors coming. You there, this is really good post here. Domain News

Ahmad222

Thank you very much for posting and sharing this great article. It is so interesting. I want to know some other information about this site. So please give me this news quickly. I always will be aware of you. www.BUYMOTORBIKE.EU

Ahmad222

This is your new blog post. Click here and start typing, or drag in elements from the top bar. www.electronicbooks.eu

Nice

Very interesting, I talk about it and my sister one day, now I will have a special multiple parameters in my hand, it will appear in a fight.Let us enjoy this blog brings us joy!
kwa maritane

Ahmad222

When I read the title of your blog I became very much surprised but after reading whole blog, I understand this properly. Really very interesting content. www.sellingbooks.eu

Ahmad222

Thanks for sharing such nice information about Social Networks. This information I can use in my tour. Thanks a lot www.ABOUTMEANDFAMILY.EU

Ahmad222

Thanks for sharing this information. I always search new and different blogs and your article is mind catching. www.DAILYFAMILY.EU

That's an excellent analogy.

That's an excellent analogy. It kind of reminds of the Red Queen's Hypothesis used to describe evolutionary systems.

Post new comment

By submitting this form, you accept the Mollom privacy policy.