Cognitive Perception

Thanks to our technology intelligent machines are able to process visual and auditory information, assess situations and interact with their environment. Because, after all, driverless road vehicles or robots working side by side with production workers must be able to perceive their environment. No collaboration between humans and intelligent systems is possible without these perception skills.

As humans we often combine different sensory impressions and process them simultaneously to get an immediate and precise picture of our environment. Similarly, if artificially intelligent systems are able to assess and predict the situations in real-time they must be able to draw on and process a variety of information from different channels.

We have, therefore, adapted Machine Learning processes to specifically process auditory and visual information. We use a combination of facial recognition and voice recognition, based on speech characteristics, to identify individual speakers, for example in TV programs. This type of multimodal indexing, i.e. the combination and analysis of information from different channels and in a variety of formats increases the likelihood of recognizing what you are looking for more reliably and quickly.

The systems we have developed for recognizing spoken German are the most accurate in the world. We can draw on large quantities of training data in the form of our own speech database which contains in excess of 1000 hours of transcribed voice recordings. Our solutions have been used successfully for many years across many sectors including the media industry and are continually being refined.

We use Machine Learning, particularly deep neural networks, to recognize objects such as road traffic signs. These Deep Learning methods are particularly successful if the data is characterized by a hierarchical structure and a large amount of training data is available.

If sufficient training data is not available for certain scenarios we favor hybrid Machine Learning processes that allow us to add the knowledge of experts to the data we do have. We also research other data efficient learning solutions designed to generate artificial training data, for example.

Go to our technical publications

Research priorities

Multimodal recognition and indexing

The combination of several input channels such as voice recognition and facial recognition alongside multimodal indexing when searching for media content and sequences such as in videos

Data efficient learning

The training we offer includes the use of data augmentation, generating artificial data, transfer learning and half supervised and active learning

Representation learning

Unsupervised learning through semantically meaningful representations of raw data based on Deep Learning processes

Embedded and real-time perception

Efficient real-time processing on devices with integrated sensors, high-performance embedded hardware and parallel processing units in mobile systems



Fraunhofer audio mining for the German television network ARD

As part of our long-term cooperation with the German television network ARD we have been using automated speech recognition and other audio technologies such as voice recognition for archival and editorial work. This tool allows journalists to search for spoken keywords across the entire ARD archive to retrieve relevant articles and contributions. Voice recognition means statements made by individuals can be accurately searched for and retrieved. Generating subtitles or transcribing raw material provides valuable editorial support.

Live subtitling in the parliament of the Free State of Saxony

The parliament (Landtag) of the Free State of Saxony uses our speech recognition system for live subtitling of plenary session broadcasts. We were tasked with training the system to recognize a range of specialist legal and political terms as well as the names of politicians. An additional module automatically inserts punctuation and ensures the text flows naturally and in a structured fashion. The speech recognizer can be accessed via the cloud. Applications requiring high levels of data protection can be installed locally using standard server architecture.

Speech dialog system

We have worked with Volkswagen AG to develop a prototype speech dialog system. The in-built dialog system acts as an interactive city tour guide answering the driver’s questions about selected points of interest along the route. The prototype is a good example of how our speech technology interacts – speech recognition, content analysis using knowledge graphs and speech synthesis – in dialog systems based on domain-specific knowledge.

Image recognition patent

We have been granted a patent for our image recognition technology which efficiently and reliably recognizes circular objects such as road traffic signs.

Recognizing road traffic signs in roadworks

Ways of recognizing road traffic signs in roadworks were developed as part of our work on the AutoConstruct BMWi project. Roadworks continue to be a real challenge for driver assistance systems. Deep Learning methods allow roadwork markings and lane guidance signs to be recognized in camera images in real time. This work is fundamental to the development of future assistance functions and the ability of highly automated driving systems to operate in roadworks.

Condition monitoring in sewer networks

In partnership with research and application partners we have been developing a system for the automated detection and analysis of damage within municipal sewer networks. This work is part of the "Automatic condition analysis of sewer pipes" (AUZUKA) BMBF project. High definition cameras and 3D sensors take images of the sewer pipe surfaces and create models. Neural networks were trained to recognize different types of damage such as tears, cracks or ingrowing roots. Human experts are then able to quickly and reliably establish exactly what type of repair is required.