First publicly available Japanese AI dialogue system can speak and listen simultaneously

July 15, 2025

The GIST First publicly available Japanese AI dialogue system can speak and listen simultaneously

Urgent need for ‘global approach’ on AI regulation: UN tech chief

July 27, 2025

China urges global consensus on balancing AI development, security

July 26, 2025

Gaby Clark

scientific editor

Andrew Zinin

lead editor

Editors' notes

This article has been reviewed according to Science X's editorial process and policies. Editors have highlighted the following attributes while ensuring the content's credibility:

fact-checked

preprint

trusted source

proofread

Say hello to J-Moshi, the first publicly available Japanese AI dialogue system that can speak and listen simultaneously — The Higashinaka Lab is developing AI-human dialogue systems designed to work alongside human operators. As part of their research, a guide robot was deployed at Osaka's NIFREL Aquarium to answer visitors' questions about marine life. Human operators could step in to provide help with complex questions. Credit: Higashinaka Lab, Nagoya University. Taken at NIFREL Aquarium, Osaka

How do you develop an AI system that perfectly mimics the way humans speak? Researchers at Nagoya University in Japan have taken a significant step forward to achieve this. They have created J-Moshi, the first publicly available AI system specifically designed for Japanese conversational patterns.

J-Moshi captures the natural flow of Japanese conversation, which often has short verbal responses known as "aizuchi" that Japanese speakers use during conversation to show they are actively listening and engaged. Responses such as "Sou desu ne" (that's right) and "Naruhodo" (I see) are used more often than similar responses in English.

Traditional AI has difficulty using aizuchi because it cannot speak and listen at the same time. This capability is especially important for natural-sounding Japanese AI dialog. Consequently, J-Moshi has become very popular with Japanese speakers who recognize and appreciate its natural conversation patterns.

Building a Japanese Moshi model

The development team, led by researchers from the Higashinaka Laboratory at the Graduate School of Informatics, built J-Moshi by adapting the English-language Moshi model created by the non-profit laboratory Kyutai. The process took about four months and involved training the system using multiple Japanese speech datasets. The research is published on the arXiv preprint server.

The biggest dataset was obtained from J-CHAT, the largest publicly available Japanese dialog dataset created and released by the University of Tokyo. It contains approximately 67,000 hours of audio from podcasts and YouTube. Additionally, the team used smaller but higher-quality dialog datasets, some collected within the lab and others dating back 20–30 years. To increase their training data, the researchers also converted written chat conversations into artificial speech with text-to-speech programs they developed for this purpose.

Ph.D. student Atsumoto Ohashi, the main developer of J-Moshi, demonstrates how the AI system mimics natural Japanese conversation patterns. He has been working on the optimization of task-oriented dialogue systems for his Ph.D. Credit: Merle Naidoo, Nagoya University
Ph.D. student Yuki Zenimoto engages with a question-guiding dialogue system that elicits user healthcare information through casual conversation. Credit: Merle Naidoo, Nagoya University

In January 2024, J-Moshi gained significant attention when demonstration videos went viral on social media. Beyond its technical novelty, it has possible practical applications in language learning. For example, helping non-native speakers practice and understand natural Japanese conversation patterns.

The research team is also exploring commercial applications in call centers, health care settings, and customer service. They note that adapting the system to specialized fields or industries is challenging due to the limited availability of Japanese speech data compared to resources available for English.

The research team's leader, Professor Ryuichiro Higashinaka, brings a unique perspective to academic AI research, having spent 19 years as a corporate researcher at NTT Corporation before joining Nagoya University five years ago.

During his industry tenure, he worked on consumer dialog systems and voice agents, including a project to realize a question-answer function for Shabette Concier, a voice agent service by NTT DOCOMO. To further pursue research on human communication patterns, he set up his own lab at Nagoya University's Graduate School of Informatics in 2020.

His 20-member lab now tackles challenges that bridge theoretical research and practical applications, from understanding conversational timing in Japanese to deploying AI guides in public spaces like aquariums.

"Technology like J-Moshi can be applied to systems that work with human operators. For example, our guide robots at the NIFREL Aquarium in Osaka can handle routine interactions independently and easily connect visitors to human operators for complex questions or when specialized assistance is needed," Professor Higashinaka said. "Our work is part of a national Cabinet Office Moonshot Project that aims to improve service quality through advanced AI-human collaboration systems."

Opportunities and challenges for human-robot interactions

Prof. Higashinaka explained the unique challenges facing Japanese AI research: "Japan suffers from a scarcity of speech resources, limiting researchers' ability to train AI dialog systems. Privacy concerns also need to be considered."

This data shortage forced creative solutions, such as using computer programs to separate mixed voices in podcast recordings into individual speaker tracks needed for training.

Currently, dialog systems have difficulty with complex social situations, especially when interpersonal relationships and physical environments need to be considered. Visual obstacles such as masks or hats can also impair their performance as important visual cues like facial expressions are covered. Testing at Osaka's NIFREL Aquarium showed that sometimes the AI cannot handle user questions and needs human operators to intervene and take over the conversation.

While J-Moshi represents a significant achievement in capturing natural Japanese conversational patterns with overlapping speech and aizuchi interjections, these limitations mean it currently needs human backup systems for most practical applications. The researchers are working to enhance these human backup systems to mitigate these challenges.These include methods for dialog summarization and dialog breakdown detection systems that alert operators to potential problems so they can respond quickly.

The lab's broader research extends beyond J-Moshi and includes multiple methods for human-robot interaction. In collaboration with colleagues working on realistic humanoid robots, they are developing robot systems that coordinate speech, gestures, and movement for natural communication.

These robots, including those manufactured by Unitree Robotics, represent the latest advances in AI in physical form, where dialog systems must navigate not just conversational nuances but also physical presence and spatial awareness. The team regularly showcases their work during university open campus days, where the public can experience how AI dialog systems are evolving firsthand.

Their paper on J-Moshi has been accepted for publication in Interspeech, the largest international conference in the field of speech technology and research. Professor Higashinaka and his team are looking forward to presenting their J-Moshi research in Rotterdam, The Netherlands, in August 2025.

"In the near future, we will witness the emergence of systems capable of collaborating seamlessly with humans through natural speech and gestures. I aspire to create the foundational technologies that will be essential for such a transformative society," Professor Higashinaka said.

More information: Atsumoto Ohashi et al, Towards a Japanese Full-duplex Spoken Dialogue System, arXiv (2025). DOI: 10.48550/arxiv.2506.02979

Listen to audio of J-Moshi here: https://nu-dialogue.github.io/j-moshi/

The codebase used for training J-Moshi is available here: https://github.com/nu-dialogue/moshi-finetune

Journal information: arXiv Provided by Nagoya University Citation: First publicly available Japanese AI dialogue system can speak and listen simultaneously (2025, July 15) retrieved 15 July 2025 from https://techxplore.com/news/2025-07-japanese-ai-dialogue-simultaneously.html This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

Language database improves automatic speech recognition of Austrian German 0 shares

Feedback to editors

First publicly available Japanese AI dialogue system can speak and listen simultaneously

Urgent need for ‘global approach’ on AI regulation: UN tech chief

China urges global consensus on balancing AI development, security

Related Posts

Urgent need for ‘global approach’ on AI regulation: UN tech chief

China urges global consensus on balancing AI development, security

Trump’s AI plan calls for massive data centers. Here’s how it may affect energy in the US

Tradition meets AI in Nishijinori weaving style from Japan’s ancient capital

AI tackles notoriously complex equations, enabling faster advances in drug and material design

AI will soon be able to audit all published research—what will that mean for public trust in science?

A human-inspired pathfinding approach to improve robot navigation

Recent News

Urgent need for ‘global approach’ on AI regulation: UN tech chief

Bitcoin Cash Surges Past $580 as Analysts Predict Breakout Toward $620–$680 Range

Nasdaq-Listed Company Announces XRP Reserve – But Doubts Remain

MAGACOIN FINANCE Investors Rush In After DOGECOIN’s Price Volatility Sparks Search for Next Stable Gem

TOP News

Bitcoin Sees Long-Term Holders Sell As Short-Term Buyers Step In – Sign Of Rally Exhaustion?

The AirPods 4 are still on sale at a near record low price

Ripple Partners With Ctrl Alt to Expand Custody Footprint Into Middle East

Cyberpunk 2077: Ultimate Edition comes to the Mac on July 17

HBO confirms The Last of Us season 3 will arrive in 2027

Welcome Back!

Retrieve your password