Sunnyvale, California – Yahoo! Inc (NASDAQ: YHOO) released to researchers and the public, the largest-ever machine learning dataset, which is a collection of records from anonymous users that files the interaction of Yahoo properties such as Yahoo News, Yahoo Sports and Yahoo Movies, from more than 20 million people.

The American company explained that the data is considered to be relevant for research related to machine learning. Previously, it was hard to access to data like the mentioned, since it was restricted to specific researchers and large companies.

Yahoo! recently released to researchers and the general public, the largest-ever machine learning dataset. Credit:

The company headquartered in Sunnyvale, California, added that when analyzing machine learning problems that were registered in the dataset collection, several researchers will be able to understand and make deeper analysis in the areas of search ranking, computational advertising, information retrieval, and core machine learning.

Suju Rajan, director of Research for Personalization Science at Yahoo Labs, said that in the last years, tech researchers have been trying to develop new algorithms and methodologies in order to increase the amount of traffic in web pages. With the released data, investigation in the segment is going to be boosted.

“Today, we are proud to announce the public release of the largest-ever machine learning dataset to the research community. The dataset stands at a massive ~110B events (13.5TB uncompressed) of anonymized user-news item interaction data, collected by recording the user-news item interactions of about 20M users from February 2015 to May 2015,” wrote Suju Rajan, PhD, from the University of Texas at Austin, in a blog post on Thursday.

The user interactions that compose the Yahoo News Feed dataset were taken from the Yahoo homepage, Yahoo News, Yahoo Sports, Yahoo Finance, Yahoo Movies and Yahoo Real Estate. Also, the company is going to provide demographic information, which includes age range and gender, from some users. It was remarked that people is not going to be exposed since all the data was provided by anonymous users.

“Our goals are to promote independent research in the fields of large-scale machine learning and recommender systems, and to help level the playing field between industrial and academic research. The dataset is available as part of the Yahoo Labs Webscope data-sharing program, which is a reference library of scientifically-useful datasets comprising anonymized user data for non-commercial use,” said Rajan.

It seems really interesting that Yahoo! have added extra data to the released data, such as titles, summaries and key-phrases of each news article published. Local time and type of device used to access the content, can also be seen, which would mean this dataset has powerful properties that could help researchers, tech companies, journalism institutes, and editors and writers from small news agencies that can’t afford to collect an amount of data of such a huge size.

Authorities from Jacob School of Engineering at UC San Diego, said in a press release published on Business Wire, that all this new public data will help the institution to impulse the research in machine learning, artificial intelligence and big data applications. On the other hand, Tom Mitchell from the machine learning department at Carnegie Mellon University, said that researchers will have significant access to realistic data that can be analyzed to understand what type of articles are of interest for different segments of users.

Last year, the company Criteo also released a public machine learning dataset, in order to encourage academic research and the creation of new algorithms that could function to determine the preference of users. Netflix and Amazon are experts in the field, every time a suggestion that invites users to buy an item or to watch a movie is because the company has developed complicated algorithms with the help of learning machine data.

Other services such as Apple Music also use the same method to make recommendations of new songs and genres of music based on the preference of listeners. Also, interaction data is being analyzed by tech companies to offer services based on localisation but just according to the preference of users.

Source: Yahoo Labs