This post was originally just focused on a then-new interview with 2 RICOH THETA engineers, focused on 360° Spatial Audio. It wasn’t available in English, so we did a quick translation and posted it. Really cool info, with a direct connection to the people building it.
Since then RICOH has translated that interview and 3 more RICOH THETA developer Interviews. I’m posted links to all 4 here for easy access.
Vol.1, “360°Spatial Audio, Ask the developer” - Atsushi Matsuura and Takafumi Ohkuma
Vol.2, “360°Spatial Audio, 3D Microphone TA-1” - Takafumi Ohkuma
Vol.3, “What are function extensions? Ask the developer” - Ryoh Fukui and Masato Takada
Vol.4, “4K spherical videos, Ask the developer” - Kazuhiro Matsumoto and Makoto Shohara
This is an unofficial translation by the Unofficial Guide of an official RICOH interview with two of the main audio engineers working on the new RICOH THETA. (Original Japanese article here. There’s now an official English version, too.)
Vol 1 RICOH THETA Developer Interview - Asking About 360° Spatial Audio
Matsuura (left), Okuma (right)
“Spatial Audio takes 4K high resolution video and makes it an even more immersive experience.”
To start off, please define what 360° Spatial Audio is
Okuma: Spatial Audio is a term that has started to be used recently. Simply put, it’s the structure for recording in 3D and replaying that sound. “Stereophonic sound” is another way of putting it. In addition to the conventional left and right directions, it’s the collection of sounds including up and down and depth. Centered on the listener, you can literally enjoy listening to sounds covering all directions of 360°.
RICOH Company, Ltd., Business Products Division, Advanced Products Lab, Takafumi Okuma
The basis of all this is a stereophonic technique called Ambisonics developed in Europe in the 1970s. However, Ambisonics required huge machinery and the settings took a lot of time as well. Technically at the time, it was not easy to use in combination with video. However, now it is possible to rotate images including VR (virtual reality), in accordance with the position of the head with HMD (head mounted display). Adding to that, if you also rotate the audio, you are able to experience a more immersive visual experience.
Why did you decide to install 360° spatial audio into the new THETA?
Okuma: Originally we were considering adopting 360° spatial audio even with the first model THETA in 2013, and again in the second generation m15, which added video, and again with the THETA S. However, at that time the video resolution was not very high at the super wide angle of the whole sky ball, so the picture and the sound were not well balanced. As a result, the timing of implementation was delayed.
Matsuura: This time around, Ambisonics can be processed within the THETA main unit since a high-performance main processor which is also used for smartphones has been installed with the main processor and can perform various program processing inside the camera. The previous THETA S and SC just did not have enough power to handle Ambisonics. Ambisonics itself is old technology, as already mentioned, but with the spread of 360° VR video and the improvement of the processing power of smartphones, the conditions for implementing Ambisonics are now possible.
Okuma: And now THETA is able to take 4K movies. Therefore, if the sound is cheap for the high resolution video, the sense of realism will be impaired. So our development team decided to upgrade sound.
The setup with the microphone for using Ambisonics which I mentioned has been released by other companies. However, even just recording the sound requires a specialized expensive recorder, and the settings for combining with the video after recording and further aligning the direction of the sound are quite hard for general users.
Matsuura: Not only special hardware but also special software is required. So it costs money for both the hardware and the software. So even if you get ahold of all those extras and the environment is setup correctly, you still need special skills to pull it all together. Even if it’s possible for people who are doing video production professionally, I think the hurdles are very high for the general public.
Okuma: This time, you’ll be able to experience 360° spatial audio with just the THETA, no extra equipment required. With little extra effort, we have a mechanism to automatically match the front of the image with the front of the sound with a single touch, so you can collect and play spatial audio linked to 360° video. As an introduction machine for 360° spatial audio, I think that VR’s footprint can be expanded similar to how the early THETA model became an introduction path to VR imagery.
“We specifically considered the balance between the thin profile that is a feature and the frequency characteristics and directivity of the microphone.”
What were some of the difficulties in implementing 360° spatial audio?
Okuma: THETA’s new model has four built-in microphones. The most difficult thing is the placement of the microphones. Depending on the position of the microphone, the frequency characteristics of sounds that can maintain directionality by signal processing change. For example, the frequency of a person’s voice is in one range, and the frequency of an instrument is in another range. Of course, it is better to cover a wide frequency, leading to greater realism. The problem here is design. THETA is thinner than other companies’ products, and balance was a challenge, including how to place it beautifully on the main body while maintaining the performance of the microphone, including the wiring of contents. Therefore, this time, we devised four symmetrical microphones in up/down and left/right positions symmetrically to keep the directivity related to the frequency characteristics and the quality of sound being recorded as much as possible.
Ricoh Company, Ltd., Smart Vision Business Division, Product Development Center, Device Development Department, Atsushi Matsuura
Matsuura: We have looked in detail how to record spatial audio in MP4 videos. When we run into issues, we iterate through solutions and always keep asking ourselves, “When will this be replayed?” Also, it’s obvious, but as we process and video and sound together, it’s a real problem if they don’t line up perfectly. we did quite a bit of verification here. Since Ambisonics itself is established, stable technology, it is easier to test and verify with it. In our prototyping and testing, we incorporated provisionally designed numerical values so we could see how much time difference and attenuation occurs for sounds arriving simultaneously at each microphone. Unfortunately, we included these parameters in a real model. And then, when we were listening to sounds from a certain direction, they were coming out extra strong. I remember being asked “Is this an implementation mistake?” They wouldn’t listen when we told them it was just a problem with the parameters. (laughs)
Okuma: Oh yea, I remember. (laughs) The prototype we made in the beginning was built by a 3D printer and the inside was empty. Since there was an echo from inside, I think that the testing parameter was different from what it would have been for a complete model. Sorry about that guys! (laughs)
In what sorts of environments does 360° spatial audio record most effectively?
Okuma: I am currently doing field tests myself and verifying results. I’m checking places like the scrambled intersection of Shibuya where sound reverberates from all directions, and in a forest where the sounds of Cicadas rain down from every direction. Among the different spots, one is a park that is located in the immediate vicinity of Haneda Airport, where the tension rose. It feels like the roar of the plane is right over your head with all the takeoffs and landings. The sounds were interesting, of course, but I was able to take kind of an interesting image that I wanted to share with someone.
Also, I play saxophone as a hobby and I’ve been recording my own performances. The sound is clearly different from previous models. I think it is recorded in a form close to the sound that I actually hear with my own ears. I think live recordings of music will really demonstrate its strength.
This is also my hobby, but when I play tennis, if I put it close to the center of the net and record, I get a sense of realism from the images and sounds that I’ve never seen before. If you’re not careful, you’ll hit the THETA with the ball, though. (laughs) I think switching between images taken in taken in different locations would also be good.
Sounds of drums and flute from summer festivals is also good. The atmosphere of a theme park or sightseeing spot could also be experienced in full. I feel that the quality of memories will change as well. Because of its high sensitivity, a variety of sounds are picked up that people are not initially aware of, so it is fun to listen later.
Matsuura: It is quite different from what Okuma has been just describing, but I think that the effect is easy to understand if it is in a closed space. Although from a story point of view it might be less fun, I think the effectiveness of spatial audio is easy to understand in places where a person is talking in a quiet room or where there are no big sounds nearby. Locations where there’s “sound with a clear source” are good. Like a drinking party at a friend 's house.
Even though I tried recording during a walk in the mountains, mountains are not suitable for spatial audio (laughs). The 4K video is very effective, but the sounds of nature come in from every direction, and there is no other sound besides that. When you’re surrounded by nature, the direction has nothing to do with it. Maybe it would be effective close to a waterfalls or something like that, so I’ll plan on trying that next time.
“The most effective audio setup is when you combine aural (closed type) headphones with a head mounted display”
Can I publish and share images of 360° spatial audio on the internet?
Okuma: You can keep using the THETA site, as usual. We also want to make it available on YouTube and Facebook, which support 360° spatial audio content. As Matsuura says, there are differences in how to record spatial audio in video format, so this is an issue that we need to keep working at to realize that. If the SNS side can support this, it will be much more convenient for our users. (laughs)
What kind of environment is suitable for listening to 360° spatial audio? Can you feel the effect even with ordinary speakers?
Spatial audio is mixed in 2 ch, optimized mainly for headphones. The best recommendation is aural (closed type) headphones. 360° spatial audio is characterized by its high resolution, so the type called monitor headphones that outputs recorded sound directly is recommended. Conversely, if the headphones are too heady in making sounds, the sound will be pulled by the characteristics of the headphones and will change from the actual reality.
Matsuura: Okuma’s recommendation would raise the bar too high (laughs), so please experience the sound with regular earphones to begin with.
Okuma: Earphones sound is much less than that of headphones, and the frequency is also narrowed, so unless you have a very sensitive ears, I personally do not recommend earphones. So, if you want to “feel more realism,” please try the aural headphones. (laughs)
When listening through speakers, if the distance between the listener and the 2 channel speaker is not constant, the balance of the spatial audio will be lost. When a person moves, the positional relationship with the speaker changes, so the realism is quickly lost. It is the same as there is 5.1 ch surround viewing position. When the rotation of the head and the picture and sound match as in Head Mounted Display (HMD), more realism can be achieved.
“Audio enhancement was also a request from users.”
Was it users who originally requested “Please enhance the audio” ?
Okuma: In the conventional model, there was an issue that the sound is distorted depending on the situation, and it was our theme to solve this issue. Wanting better sound quality was a major feedback from users, and I wanted to respond to that by all means. In the new model, the performance of the single element of the microphone has been greatly upgraded. Specifically, we switched from an analog microphone to an electronic microscopically created MEMS microphone. Even though it is small, people’s voices are well recorded, and it has also been adopted for smartphones in recent years. In other words, the MEMS microphone is one of the optimal solutions for the current THETA, which incorporates four microphones in a thin design. The merit of MEMS microphone is that the variation in quality is small. If it is an ordinary microphone, when four are installed, characteristics between microphones will vary, and it becomes difficult to maintain the performance of spatial audio. The fact that the characteristics do not vary means that the spatial audio can be accurately recorded.
Also, even if a very loud sound comes in, it is recorded without distortion. Two recording modes are prepared. One is a mode to suppress the sound volume gain, when a loud sound comes in, so not to distort the sound as much as possible. The second is a mode in which the sound is not distorted, if it is within a normal range. This should be especially effective with live streaming, which was difficult up till this point.
Matsuura: This is a comment completely from the user’s point of view. Once in awhile, while using my THETA, there would be a loud noise, and I was concerned that the sound would break. I wanted “a mode that allows video to be taken without sound.” When shooting it with lots of friends, the conversation sounds are more like an annoying loud noise than a cheerful one. (laughs)
Okuma: In order for people to hear easily, many previous cameras have been adopting a method that automatically puts the gain in a certain range (auto gain control). This is good when you want to take a specific direction, but the real feeling is lost as it is different from the actual sound. Because inflection is expressed in orchestral, musical instrument or vocal performances, it was not suitable for music.
Even in spatial audio, sound inflection caused by distance or directional difference is important. Although the adjustment was made for a better gain compared to the THETA m15, with the first video function, there still was a performance limitation of the microphone and the capturing codec (hardware of the audio capturing section) with THETA S.
Matsuura: That’s right. I think by not including auto gain control, the sound became much more real. At this point, there are not many cameras like this new model in the world, so there is nothing we can compare to, and it was trial and error to figure out “where to aim for."
In order to popularize the product, an ordinary player has to be able to reproduce sound and image, much better than with the THETA S and SC. On top of that, in the case of Ricoh’s own player (application), spatial audio needs to be nicely reproduced. The video format of this new THETA and spatial audio were created by taking these points into consideration. We would love for people to try video and sound full of reality in various scenes and usages.
Okuma: That’s right. First of all, please purchase one (laughs). We want people to have a whole new experience!
RICOH Company, Ltd., Business Products Division
Advanced Products Lab
Joined RICOH in 2006. In college engaged in research on environmental electromagnetic engineering. At RICOH, responsible for development of electrical hardware for the compact cameras GR and THETA, and software operation specs relating to hardware. Hobbies include playing saxophone and tennis.
Ricoh Company, Ltd., Smart Vision Business Division
Product Development Center, Device Development Department
Joined RICOH in 2010. Conducted research on color engineering in college. Responsible for the GUI for the GR camera as well as THETA sound development, and areas outside of recording. Outside of work, likes hiking and movies to constantly refresh perspective.