Moving Objects Dataset: Something-Something v. 2

Your model recognizes certain simple, single-frame gestures like a thumbs-up. But for a truly responsive, accurate system, you want your model to recognize gestures in the context of everyday objects. Is the person pointing to something or wagging their index finger? Is the hand cleaning the display or zooming in and out of an image with two fingers? Given enough examples, your model can learn the difference.

The Something-Something dataset (version 2) is a collection of 220,847 labeled video clips of humans performing pre-defined, basic actions with everyday objects. It is designed to train machine learning models in fine-grained understanding of human hand gestures like putting something into something, turning something upside down and covering something with something.

Samples from the Something-Something dataset:

Sample classes

Putting something on a surface
Moving something up
Covering something with something
Pushing something from left to right
Moving something down
Pushing something from right to left
Uncovering something
Taking one of many similar things on the table
Turning something upside down
Tearing something into two pieces
Putting something into something
Squeezing something
Throwing something
Putting something next to something
Poking something so lightly that it doesn't or almost doesn't move

Dataset details

Total number of videos220,847
Training Set168,913
Validation Set24,777
Test Set (w/o labels)27,157
Labels174
Quality100px
FPS12

The dataset was created with the help of more than 1,300 unique crowd actors.

Developers like you have successfully created classification models based on the training set and found that they perform well on the validation set. Running their models on the test set, they can achieve scores of up to 91 percent.

The video data is provided as one large TGZ archive, split into parts of 1 GB maximum. The total download size is 19.4 GB. The archive contains webm-files, using the VP9 codec, with a height of 240px. Files are numbered from 1 to 220847.

For each video in the training and validation sets there is an object annotation in addition to the video label, if applicable. For example, for a label like "Putting [something] onto [something]," there is also an annotated version, such as "Putting a cup onto a table." In total, there are 318,572 annotations involving 30,408 unique objects.

To reduce label noise, five different crowd actors have verified that the action shown in each video matches the description given. The dataset contains only those videos in which all five crowd actors confirmed the match.

Dataset license

Something-Something is freely available for research purposes.

Labels

Dataset download

Please download ALL files including the download instructions.

NOTE: Download speeds may be slower than usual due to increased traffic.

Citations

The ‘something something’ video database for learning and evaluating visual common sense,” Goyal, R. et al., arXiv.org, June 15, 2017.

On the effectiveness of task granularity for transfer learning,” Mahdisoltani, F. et al, arXiv.org, November 29, 2018.

Qualcomm AI Research

AI is shifting from simply seeing what is happening in front of the camera to understanding it. Data is the effective force behind these deep learning breakthroughs and is integral to the human-level performance of neural networks. Our crowd-acting approach to data collection overcomes the typical limitations of crowdsourcing, resulting in high-quality video data that is densely captioned, human-centric and diverse.

Qualcomm AI Research continues to invest in and support deep-learning research in computer vision. The publication of the Jester dataset for use by the AI research community is one of our many initiatives.

Find out more about Qualcomm AI Research.

For any questions or technical support, please contact us at [email protected]

Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc.