Context and objective of the corpus collection
This corpus was recorded by the LIG laboratory (Laboratoire d’Informatique de Grenoble, UMR 5217 CNRS/UGA) as part of the VocADom project funded by the French National Research Agency (Agence Nationale de la Recherche/ANR-16-CE33-0006). The authors would like to thank the participants who accepted to perform the experiments.
This corpus is composed of audio and home automation data acquired in a real smart home with French speakers. This campaign was conducted within the VocADom project aiming at designing a new smart home system based on audio technology. The aim of the system is to provide assistance via natural man-machine interaction (voice and tactile command) and security reassurance by detecting distress situations. This allows the inhabitants of the home to manage their environment at any time in the most natural way possible, from anywhere in the house.
This data set is intended to be useful for the following (not exclusive) tasks :
- multi-human localization and tracking
- multi-Human Activity Recognition (HAR)
- smart home context modeling
- multi-channel Voice Activity Detection (VAD)
- multi-channel Automatic Speech Recognition (ASR)
- multi-channel Spoken Language Understanding (SLU)
- multi-channel Speaker Recognition (SR)
- multi-channel speech enhancement
- multi-channel Blind Source Separation (BSS)
- context-aware automatic decision making
If you use this dataset, please cite
François Portet, Sybille Caffiau, Fabien Ringeval, Michel Vacher, Nicolas Bonnefond, Solange Rossato, Benjamin Lecouteux, Thierry Desot (2019) Context-Aware Voice-based Interaction in Smart Home -VocADom@A4H Corpus Collection and Empirical Assessment of its Usefulness. 17th IEEE International Conference on Pervasive Intelligence and Computing (PICom 2019) Fukuoka, Japan
Home automation Smart Home of the LIG Laboratory
The corpus was acquired using the Amiqual4Home smart home. The Amiqual4Home apartment used to collect this dataset has the following layout with (a) the ground floor and (b) the first floor:
a) b)
Amiqual4Home is fully functional and equipped with sensors, such as energy and water consumption, level of humidity, temperature, and actuators able to control lighting, shutters, multimedia diffusion, distributed in the kitchen, the bedroom, the office and the bathroom. An observation instrumentation, with cameras, microphones and activity tracking systems, allows experimenters to supervise the recordings from a control room connected to Amiqual4Home. For the project, the flat was equipped with 16 microphones (4 arrays of 4 microphones each) set into the ceiling. Real-time recording was possible by means of software dedicated to record the audio channels simultaneously.
Room | Binary | Integer | Real Number | Categorical | Microphone area |
Entrance | 3 | 1 | 2 | 3 | 0 |
Kitchen | 13 | 21 | 18 | 0 | 1 |
Living room | 16 | 6 | 8 | 7 | 1 |
Toilet* | – | – | – | – | – |
Staircase | 3 | 0 | 0 | 0 | 0 |
Walkway | 9 | 0 | 1 | 0 | 0 |
Bathroom | 9 | 6 | 8 | 3 | 1 |
Office* | – | – | – | – | – |
Bedroom | 17 | 4 | 6 | 7 | 1 |
ALL | 70 | 40 | 43 | 20 | 4 (16 channels) |
*: not used room
Description of the dataset
The final data set is composed of 122Go from the 16 audio streams (4 array of 4 channels) , 1.4Go of a single audio stream (worn microphone) and about 7 Mo of openHAB event logs. Annotation of Ground-truth (location, activity and transcription) is also provided with the dataset.
Participants’ recordings (duration in the format hour:minute:second)
Participant | Age group (years) | Gender | Duration | Chosen keyword |
S00 | 20-23 | M | 01:01:58 | vocadom |
S01 | 20-23 | M | 00:48:21 | vocadom |
S02 | 20-23 | M | 01:06:10 | hé cirrus |
S03 | 20-23 | M | 01:09:11 | ulysse |
S04 | 23-25 | F | 01:03:06 | téraphim |
S05 | <20 | F | 01:04:56 | allo cirrus |
S06 | 23-25 | M | 00:55:22 | ulysse |
S07 | 25-28 | M | 01:03:23 | ichefix |
S08 | 23-25 | M | 01:12:52 | ulysse |
S09 | 23-25 | F | 01:17:21 | minouche |
S10 | 23-25 | F | 01:10:43 | hestia |
All | (mean) 23-25 | 4F/7M | 11:53:23 | 8 keywords |
Data structure
Under the record/ directory each record is organized as follows:
S<??>/ | openhab_log/ | (log of the home automation network) | publicly available |
activity/ | (annotation of the participants’ location and activity) | ||
mic_array/ | (microphone array recordings) | available upon request | |
mic_headset/ | (headset microphone recordings) | ||
speech_transcript/ | (transcription of the participants’ speech) | ||
NLU/ | (semantic annotation of the voice commands) | ||
video/ | (video records for annotation purpose) | restricted |
Recording phases
The recording was performed continuously from the first time the participant enters the smart home to the end of the experiment. Three phases were designed to permit voice command elicitation, multiple users producing voice commands while enacting a visit scenario and, voice commands generation in noisy acoustic environment. Each phase was preceded and followed by discussions with the experimenters who gave instructions to the participant. These are also included in the data set.
- Phase 1 – Graphical based instruction to elicit spontaneous voice commands (interaction with the home)
- Phase 2 – Two-inhabitant scenario enacting a visit by a friend (interaction with the home)
- Phase 3 – Voice commands in noisy domestic environment (reading of voice commands in the home)
Structure of the voice commands
The voice commands had to start with a keyword chosen at the beginning of the experiment. Then, the command could be composed of an action to perform in a room or on an object in a free order. For instance, “KEYWORD turn off the light” and “KEYWORD I want the light to be turned off please” would be equivalent.
Participants were free to act on an object and to interrogate the state of some object/property using closed-ended questions. For instance, “KEYWORD is the front door open?” would be allowed but not “KEYWORD what is the state of the bedroom light?”.
The voice commands do not allow for conjunction or disjunction (e.g., “KEYWORD turn off the light and raise the blind” would not be understood) but permits to act on groups of objects (e.g., “KEYWORD turn off ALL the lights of the upper floor” would be understood)
Annotations
The whole set has been annotated with the following ground truth information:
- room-wise location of all participants;
- activities of all participants;
- temporally segmented transcription of the participants’ speech (full participant transcription on-going);
- semantic annotation of the transcription for Spoken Language Understanding.
How to get access to the dataset?
- Corpus with Home Automation only and activity labels: can be downloaded from the following gitlab: https://gricad-gitlab.univ-grenoble-alpes.fr/getalp/vocadoma4h/
The gitlab also contains other material to interpret the data (sensors list, floor plans with sensors position, participants’ demographics, etc.). - Audio data of the Corpus: can be downloaded from the following website: https://persyval-platform.univ-grenoble-alpes.fr/DS272/detaildataset. You must register to request access to the dataset.