Moving Objects Dataset: Something-Something v. 2
Your model recognizes certain simple, single-frame gestures like a thumbs-up. But for a truly responsive, accurate system, you want your model to recognize gestures in the context of everyday objects. Is the person pointing to something or wagging their index finger? Is the hand cleaning the display or zooming in and out of an image with two fingers? Given enough examples, your model can learn the difference.
The Something-Something dataset (version 2) is a collection of 220,847 labeled video clips of humans performing pre-defined, basic actions with everyday objects. It is designed to train machine learning models in fine-grained understanding of human hand gestures like putting something into something, turning something upside down and covering something with something.
Samples from the Something-Something dataset:
Putting something on a surface
Moving something up
Covering something with something
Pushing something from left to right
Moving something down
Pushing something from right to left
Taking one of many similar things on the table
Turning something upside down
Tearing something into two pieces
Putting something into something
Putting something next to something
Poking something so lightly that it doesn't or almost doesn't move
|Total number of videos||220,847|
|Test Set (w/o labels)||27,157|
The dataset was created with the help of more than 1,300 unique crowd actors.
Developers like you have successfully created classification models based on the training set and found that they perform well on the validation set. Running their models on the test set, they can achieve scores of up to 91 percent.
The video data is provided as one large TGZ archive, split into parts of 1 GB maximum. The total download size is 19.4 GB. The archive contains webm-files, using the VP9 codec, with a height of 240px. Files are numbered from 1 to 220847.
For each video in the training and validation sets there is an object annotation in addition to the video label, if applicable. For example, for a label like "Putting [something] onto [something]," there is also an annotated version, such as "Putting a cup onto a table." In total, there are 318,572 annotations involving 30,408 unique objects.
To reduce label noise, five different crowd actors have verified that the action shown in each video matches the description given. The dataset contains only those videos in which all five crowd actors confirmed the match.
Something-Something is freely available for research purposes.
Please download ALL files including the download instructions.
NOTE: Download speeds may be slower than usual due to increased traffic.
“The ‘something something’ video database for learning and evaluating visual common sense,” Goyal, R. et al., arXiv.org, June 15, 2017.
“On the effectiveness of task granularity for transfer learning,” Mahdisoltani, F. et al, arXiv.org, November 29, 2018.
Qualcomm AI Research
AI is shifting from simply seeing what is happening in front of the camera to understanding it. Data is the effective force behind these deep learning breakthroughs and is integral to the human-level performance of neural networks. Our crowd-acting approach to data collection overcomes the typical limitations of crowdsourcing, resulting in high-quality video data that is densely captioned, human-centric and diverse.
Qualcomm AI Research continues to invest in and support deep-learning research in computer vision. The publication of the Jester dataset for use by the AI research community is one of our many initiatives.
Find out more about Qualcomm AI Research.
For any questions or technical support, please contact us at [email protected]
Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc.