USPTO-2M: Patent Classification Benchmark Dataset with 2 million patents and 49,900 test patents

Jie Hu, Jianjun Hu (
Machine Learning and Evolution Laboratory
Department of Computer Science and Engineering
University of South Carolina
Cite the dataset:
Shaobo Li, Jie Hu, Yuxin Cui, Jianjun Hu, DeepPatent: Convolutional Neural Networks for Patent Classification. Information Processing and Management. 2017


Patent classification is an important task in patent information management and patent knowledge mining. However, this task is still largely done manually due to the unsatisfactory performance of current machine learning algorithms. Patent classification is a multi-class classification problem with large number of samples with 637 categories at the subclass level. A standard dataset for benchmarking patent classification algorithm is benetifical to the research community. A related smaller benchmark dataset is the CLEF-IP dataset. The data collections are extracts of the MAREC dataset, containing over 2.6 million patent documents pertaining to 1.3 milion patents from the European Patent Office with content in English, German and French, and extended by documents from the WIPO.


Our USPOT-2M dataset contains 10 years USPTO patent data which is cleaned and organized into JSON format.
Statistics of the dataset:

Training dataset: 1950247 patents; Test dataset:49900 patents
number of different IPC-R subclasses in Training dataset: 637, number of different IPC-R subclasses in Test dataset(607)
average number of labels per patent in Training dataset: 1.32, average number of labels per patent in Test dataset: 1.93

a sample of our data. { "Subclass_labels": [ "A43B", "A41D", "A43C" ],
"Abstract": "a decorative and or promotional accessory to be secured to a lace such as a shoe lace includes a molded plastic body having a passage longitudinally extending therethrough from a first opening to a second opening the passage is sized and shaped to receive the lace therethrough and to frictionally secure the body in a desired position along the lace the accessory also includes indicia provided on an exterior surface of the accessory which can be in the form of any desired message name number logo graphic or the like an alternative embodiment of the accessory is disclosed which is to be secured to a cap bill this embodiment includes a slot radially extending to the passage which is sized and shaped to receive the cap brim therein and to resiliently grip the bill and removably secure the accessory in a desired position along the bill",
"Title": "accessory for shoe laces hat brims and the like",
"No": "US08925116" }


Download the whole dataset in a single zip fle (414M).

Download the dataset by Year

USPTO_2006 (150M)
USPTO_2007 (135M)
USPTO_2008 (135M)
USPTO_2009 (143M)
USPTO_2010 (186M)
USPTO_2011 (191M)
USPTO_2012 (216M)
USPTO_2013 (236M)
USPTO_2014 (255M)
USPTO_2015 (45M)