We have updated 1,216,317 papers from ArXiv (https://arxiv.org/) to our database. Each paper data contains title, abstract, authors and category. Paper's category information is added by its author manually which guarantees accuracy.
We use these papers as standard corpora to test the precision of our algorithms for documents vectorization and classification. Since documents vectorization is the foundation of locating a paper and algorithms of documents classification are executed to colorize papers whose fields are unknown, by properly using this dataset, we are supposed to increase the expressiveness and reasonableness of our paper map.
ArXiv began as a physics archive and soon expanded to include astronomy, mathematics, computer science, nonlinear science, quantitative biology and, most recently, statistics. Mathematicians and scientists regularly upload their papers to it for worldwide access and sometimes for reviews before they are published in peer-reviewed journals.
The default license of ArXiv only declares that if we build indexes or tools based on the full-text we must link back to arXiv for downloads. So it is free for us to only use the above 4 informations.