In this demonstration, we will showcase real-time grounded language learning on the humanoid robot Pepper. In particular, learn- ing word-object and word-action mapping from cross-modal data, where simple actions, such as take, put and push, are shown to the robot by a human tutor. The visual demonstration is accompanied by verbal descriptions of the performed actions, such as I take the box and put it next to the bottle. Learning was realized on the humanoid robot Pepper using the Google Speech API for speech to text and the robot’s camera system for object tracking.