Cambridge, Massachusetts – Scientists from the Massachusetts Institute of Technology (MIT) designed a new system called Data Science Machine that aims to search different patterns in databases faster and more efficient than human beings by using at least two types of algorithms.

The new system was created by Max Kanter whose computer science master thesis is the foundation for the Data Science Machine, alongside Kalyan Veeramachaneni, thesis advisor and a research scientist at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL).

MIT-Data-Science-Machine
MIT researchers aim to take the human element out of big-data analysis, with a new system that not only searches for patterns but designs the feature set, too. Credit: Archive.datacenterdynamics.com

This new prototype acts as a natural complement to human intelligence and it beats more than 600 teams in finding predictive patterns that were buried within unfamiliar data sets. The prototype eliminates human intuitions and makes the searching process just a fraction of the time it took humans, meaning that humans may become obsolete in the big data analysis. Although, the system not only searches for patterns, it also designs the feature set.

“We view the Data Science Machine as a natural complement to human intelligence,” said Max Kanter, one of the researchers who designed the big data system.

Big-data searching process consists in searching hidden patterns and choosing features of the data to analyze. Now the machine is capable of making determinations when it goes to search patterns.

“What we observe from our experience solving a number of data science problems for industry is that one of the very critical steps is called feature engineering […] The initial thing you have to do is recognize what variables to bring out from the database or compose, and for that, you need to come up with a lot of ideas,” Veeramachaneni said to MIT news.

The tracking process

First, it begins importing costs from the first table into the second. Second, it would execute suite operations to generate candidate features such as total cost per order, the average cost per order, minimum cost, and so on. The machine also looks for categorical data and generates further feature candidates by dividing existing features across categories.

Testing

To test the new Data Science Machine, scientists from the MIT competed against human teams in three data science competitions. One of these competitions aimed to find predictive patterns in unusual data sets.

In fact, in two of the three competitions, the predictions made by the Data Science Machine were 94 percent and 96 percent as accurate as the winning submissions. In the third, it was a more modest 87 percent. But while the human teams labored over their prediction algorithms for months, the Data Science machine took between two and 12 hours to produce each of its entries.

In the first two competitions, the prototype gained success with 94 percent and 96 percent accuracy whereas the Data Science Machine’s prediction was 87 percent accurate in the third competition. One of the most surprising things was revealed in the patterns searching process. Human teams took months to figure out, but, on the contrary, these unfamiliar patterns were easily revealed by the Data Science Machine in just 12 hours, meaning that the new data system completed its algorithms at “inhuman” speed. Of the 906 teams participating in the three competitions, the researchers’ “Data Science Machine” finished ahead of 615.

Furthermore, the system needed at least two algorithms to make predictions. Veeramachaneni used the different machine – learning techniques to practical problems in big – data analysis, such as determining the power generation capacity of wind farm sites or predicting how students are at risk for dropping out online courses.

Moreover, the two scientists used some tricks to manufacture a candidate for data analysis. Databases normally extract different types of data in different tables, indicating the correlations between them using numerical identifiers. The new system tracks these correlations by using them as a signal to feature construction.

“There’s so much data out there to be analyzed. And right now it’s just sitting there not doing anything. So maybe we can come up with a solution that will at least get us started on it, at least get us moving,” Said Kalyan Veeramachaneni to MIT news.

The project will be presented next week at the International Conference on Data Science and Advanced Analytics (IEEE).

Source: MIT