top of page

Asynchronous Audio Technology with Carl Robinson

By: Thoughts of a Random Citizen Podcast

Today, my guest is Carl Robinson, CEO, and co-founder of Rumble Studio, a startup in Paris on the frontier of asynchronous interview audio technology. Carl, welcome to the podcast. As I dive deeper into your technology, I see it as groundbreaking, especially when incorporating AI. Can you walk us through what your company Rumble Studio does?

Rumble Studio is an online tool, is a SAS that helps creators, agencies, and companies create audio content, like podcasts, much more quickly, easily, and affordably. The unique part of our solution is that we don't run live interviews like we're doing now but asynchronous interviews, which are, as a brand, take-turns interviews. Essentially you set the questions upfront, send an invitation to your guest or guests, you can interview as many people as you want at the same time, and then the guests are interviewed by Rumble Studio automatically in their own time.

You say Rumble Studio interviews them; can you elaborate on that?

We're a young startup, building this in phases right now. We're at phase one, which means that you can write and record the questions into Rumble Studio and then send them. The experience right now for the guests is a type questionnaire form. Still, with the addition of the audio capture, a guest can click the public link or the link that you shared that goes straight into Rumble Studio, they read the question, or they can optionally listen to the host actually speak out, so there's more feeling and detail in the audio, and then they can record the answer. When they're happy with that, they can move to the next question; if they're not, they can delete, redo, and build the question up in multiple parts. They can go make a cup of tea, think about the answer, and then come back and record it again. There are a number of advantages, but they go through it question by question.

So the core of it is recording audio from the guest, but there's also the ability to capture text, images, videos, and we're adding more and more of these node types. Right now, it's a static type questionnaire; there are no follow-ups. As a host, you capture that audio, and when you go back into the platform, you download it or load it up into our export mix feature. You see your questions, you see the guest's answer, and you have the option to then record a follow-up comment or some kind of narrative. That's how you can simulate a real conversation. In Rumble Studio's current version, there's no ability to ask follow-up questions on the fly as the guest is recording, which is the vision for Rumble.

Rumble Studio logo

That was one of the things that I thought of when doing it. It's great because you can give these thoughtful responses, so I really liked that aspect, but will you be able to monitor when a guest is recording their response and then just shoot over a follow-up?

This is the AI bet that we're coming into. So phase two, and this is actually what our data scientists are working on, is a set of technologies that listen to what the guests say so that they hear the question and record their first answer. That answer is recorded in audio is then transcribed into text, and then it's run through a series of modules that we're developing to extract different features from that speech, both text form and audio form.

In text, we can do things like pull out keywords using named entity recognition and analyze those; we can identify the topics in general that they're talking about using topic modeling. We can also analyze the audio and do emotion detection, sentiment analysis, figuring out how they're feeling. We can measure the speed at which they're speaking to coach them to speak more quickly or more slowly. We build up all these characteristics on how they said it and what they said, and then we can build the decision-making part of that, which then decides how to react.

So the follow-up isn't done by the host; it's done by the AI?

Absolutely. That's the plan for Rumble, to be able to generate a follow-up question in real-time and put it to the guests. So to give you some examples, say a guest gives a particularly short answer, which is not great for a podcast; you can just say, "could you tell me a bit more, please?" It's super easy and secure. You probably don't even need AI to do that. Also, they could mention a keyword that's super hot in the news; maybe they mentioned will Smith, so you'll say, "what do you think about the will Smith thing?"

So these open questions lead the guests to talk a bit more, and in that way, you can capture more for the same amount of effort, bearing in mind you've just written one question into Rumble and just sent them a link. At this point, Rumble is doing it all for you. They're asking these follow-up questions, capturing more audio, and simultaneously making the experience more dynamic for the guests, so they don't think it's just a questionnaire as it asks them on the spot stuff they hadn't prepared. It makes it more spontaneous, which improves the audio for the listener.

It's really unique, considering the way podcasts are done now. I've been deep-diving into API and creating ways to automate workflows, and this technology seems to pair perfectly with that. After you send over the question list, if you automate it properly, your podcast is done and published, and you don't even have to do anything else. That's just incredible. Still, what kind of organizations and individuals are you seeing this technology helping the most today?

We're running experiments with the three main segments, which are: individual creators, influencers; marketing agencies who want to create audio content for their customers as quickly and cheaply as possible, as well as offer new innovative forms of content to their customers; and brands who want to create branded podcasts for themselves, or audio segments that they can convert into videos for social media or put on their website. Generally, a lot of brands are thinking about podcasts these days because podcasts are super hot.

So we're going to come back to audio and dive a lot more into everything that you're doing in this space, but I want to touch briefly on your time spent in China, where you lived for seven years. Can you tell us a bit about that experience and what initially made you leave the UK, where you're from?

I graduated in science, and then I did a few years in management consultancy in the city of London, which was great, but not really the job for me; that was the push factor. The pull factor towards China was a number of things; my friend had moved there and suggested doing a startup; he said it was super cheap to live over there. This was back in 2009, so China was a little bit different back then. Then a number of things happened in my personal life which made me want to have a career change. I realized working in the big city wasn't for me. So that was the reason why I moved over there, and I don't regret it for a second because it was a fantastic experience.

What is the entrepreneurial environment like? I guess it's probably a bit different now, and you can talk about that if you know anything about it, but when you went over there, what was the entrepreneurial environment like for startups?

There was a lot of energy; there's a lot more going on than I thought. I was more exposed to the small Western community of startup entrepreneurs than to the enormous burgeoning community of Chinese startup entrepreneurs. I think they mix a lot more now, but they were a little more separate when I was there. There's no comparison in terms of size. I think starting a startup in China is a very difficult thing, and anyone who does it successfully hats off to them because China is very protective of the way they do business.

I know there are a lot of horror stories of Westerners that do set up a business over there, and as soon as they achieve a level of success, the government just swoops in and steals it off of them and moves it over to a Chinese partner. So, you have to set up your business with a Chinese partner. They won't let you just create a business on your own and then just extract value from their country like that. The support mechanisms, compared to what I've found to be in France, London, or the US, were just not there.

When I was there, they didn't follow through. We actually did a startup competition; we got second place in this hackathon weekend, which is cool. The district of Xi'an said they would support us, but then the meetings that followed just weren't organized, and it fell apart. It was clear that they weren't particularly committed to the process, and there just wasn't the support framework that I've discovered in Europe, to be honest. So it was very difficult.

So what kept you there for those seven years then?

I did a number of different things. When I first got there, I started with my friends in this translation marketplace. We were complete noobs; we didn't know what we were doing around startups. There wasn't a huge amount of support as well. So we were just winging it and learning on the fly, but it takes time. You've got to invest in making the mistakes, and so this dragged on, partly because we hired a very small agency to start building the code, which I think is error number one. You really need a CTO who can actually build it in the company.

It was taking so long that we set up a second thing, which was an iPhone app as my friend's idea. It was a simple iPhone app for healthy eating that would help people track their fruit and veg consumption. There were three of us; we've put in a thousand dollars each and six months to build it. We got this cool freelancer to build the code, a really amazing designer from the US that did an incredible job with the graphics. It all just came together because it was a much smaller, more manageable project. We got a really high-quality product out. It ended up getting featured by Apple on the iTunes store, which back then was probably a lot easier than it is today. It got many downloads, and we were really excited. Then we got contacted by a company who offered to buy it, and we sold it for a considerable sum.

It gave us loads of confidence about entrepreneurship. It was possible and gave us some money, which goes a lot further in China to be able to continue doing the first startup, which ended up eating a lot of that money up, which we ended up abandoning after four years. So the first idea lasted four years, and the second startup idea lasted about a year, and it was much more profitable. Then I joined an American startup called Gather Health, an incredible experience because I got to work as a product manager there and go through a full startup journey. They were funded, and I was working in a team of 30, split between China and India. That was my career in China.

So, if you could give someone a piece of advice if they wanted to go to China, would you recommend it today? What would that advice be?

I don't know if I'd recommend it to them, not to have the same experience I did anyway. I would absolutely recommend visiting, and I would say, if you do, you need to go with someone who speaks Chinese and is a proper guide who knows the city and can take you around. Still, I don't know about living there now because the prices have gone way up, the government is way harsher on a lot of things, and there isn't that kind of feeling of an economy in a society changing as quickly. I mean that they're changing, of course, and developing quickly, but not in the same way as 2009.

I got there a little bit late. The bigger changes happened in the decades before that. So I think if you really want that kind of feeling of being in a country that's going through massive changes and there's a lot of opportunities, go to one of the other countries that are developing more quickly. I hear about Vietnam, for example; maybe you'll have more of a senior experience there today, as I did in 2009 in China.

What was the city that you were based in?


I know you've been living now in France since 2016. Why did you change to France specifically?

For a number of reasons, in Beijing, I met my now wife, Veronique, who's french. I have French citizenship because of my family history; I've got dual nationality, and Brexit was a bit of a problem at the time as well, but coming to France was super easy. I wanted to study, so I went back to school, and universities in France are excellent and much cheaper than in the UK. I ended up doing a two-year data science master's in France for a fraction of the cost it would be in the UK, and I wouldn't even want to think what it costs in the US.

I'm curious; you say you have dual nationality. Did you speak a little French? Did you pick up Mandarin? How has the language barrier traveling to such proud non-English speaking countries like China and France?

They got that reputation, but to be honest, especially in the people I meet in the circles like the startup world and the tech world, everyone speaks really good English. It is hard for a Brit or an English speaker to improve their French. Maybe it sounds like an excuse, maybe it is, but it's hard work to improve your French when everybody can speak English if you want. So you just speak English because when you're in a conversation, you start in French, then they immediately realize that it would be easier in English.

Right. You cannot learn another language if you're English because one, they want to practice, and two, they don't want to waste their time.

Yeah, you're wasting their time in a way, just using them for language practice. So it's hard work. I know it's important to get to a certain level because it shows you're making an effort, and there are certain things like administration getting around, you do need to know a level of French, but you don't need to be a hundred percent fluent.

China was very difficult in the first two years. I'd say much harder because they really don't speak as much English. I was very reliant on my friends in the first couple of years, then I learned enough to be able to do a lot of stuff on my own, like go to the bank and shop and all that, but still nowhere near enough to be able to work.

French, you can understand a bit, but Mandarin is completely different.

Definitely, to be honest, I know people who speak fluent Chinese, but it's not fluent enough for them to be competitive in the workspace. So even though they are amazing at speaking Chinese and have invested years of their lives, they're still unable to develop their careers in the way they want. They end up moving back to the states or wherever, just because there's that barrier.

Anyways, back to audio. You've been in the voice technology space for some time now, not only founding your current company, Rumble Studio, but with your popular voice tech podcast. You're no doubt considered a leader and an innovator in the voice tech space. So, as more and more tools and software come out and cut down the traditional time and length it takes to produce and edit audio content, Rumble Studio being one of them. How do you see the growth of this industry over the next five years?

When I joined the voice tech community a few years ago, there was a lot more buzz. It was newer, and there was a lot of optimism that the voice interfaces would be the new mobile apps because you can install them as skills or enable these skills, and then they would solve a variety of your day-to-day problems, but that hasn't really transpired. Certain apps are handy, but generally, there are only two or three killer apps on a smart speaker: on-demand music; smart home, being able to turn off light switches; or access features on your phone, as you can do with Siri. Transactional tasks. Still, we're a long way off from being able to speak naturally with a device and enjoy some of the high-level benefits that were promised back then.

Today's trends are moving more towards the enterprise, which was something that didn't happen initially. It was much more consumer-focused, with smart speakers coming out and the enterprise taking a back seat. Then things have moved more towards the enterprise, especially call centers and customer service. From speaking with other experts in the field, I understand that things are moving more towards the enterprise and will continue to do so.

Amazon Echo Dot Smart Speaker on table, voice assistant Alexa

That's news to me. Speaking of the audio industry, you were just touching on it, but considering you went back to France to obtain a master's degree in data science, where you've learned how to code specifically for AI. How far off do you think we are to having voice assistance for the consumers that you can interact with and teach via your own voice? I wanted to emphasize the teaching aspect, such as Alexa, some of the easy, small customizations she can't do or just shuts up. How far are we off from being able to teach via voice commands a voice application?

I completely know the frustration because I've got the Google stuff set up at home, and I ask the same things every day, but it still has the same problem. It's incredible. I think we're still a way off to be honest, because there are many variables that they still need to fix. There are the different audio conditions and the different accents. I live in a bilingual household, and it's awful with French. I think also the investment in the consumer aspects has declined. I don't think the companies are investing as much now that the hype has died down, and there doesn't seem to be as much money in it. So the research will continue for sure. No doubt we will get there, but I think we're a few years off at working flawlessly, and more than that, to have a genuine back and forth, interesting conversation. I think that there's definitely research to be done.

Amazon Alexa, which was kind of the forefront leader of this space, came out almost 10 years ago, and it feels like after they released that, they left it behind, no updates. What would it require? Is it some kind of translation from your voice into coding? Because that's gotta be pretty complex to really come down to being able to teach and train a voice bot.

It's a huge stack of technologies, from being able to translate the audio of your voice into text and then to being able to interpret the text. That's where a lot of the work goes, the natural language understanding, figuring out what you mean. Not the words you said, but even with perfect words, you could mean many different things. There's a lot of nuance there, a lot of context. It needs to know who you are personally, where you are, what you were doing at the time, how you're feeling, and what it is that you were trying to achieve two minutes ago, which adds context to what you're trying to do.

There's a huge amount of stuff that we just take for granted as humans, and we just wrap all that up and understand what someone wants. Machines really struggle with that. I get disappointed from just turning on a light switch. What I want is obvious, is not ambiguous, and still, there are problems with that. So I think a lot of the hype has died down because people feel a little bit disappointed that you can't even do simple things like that reliably. It's not a reliable light switch, and it's embarrassing if you use it in front of people.

I totally get it. Many people probably aren't aware of this, but as a podcaster, we translate things for Google search to put in a post, and you can see on just a robotic translation how terrible it is. Not only miss words, but it puts abbreviations in the wrong places. It doesn't understand at all what you're saying. So I'm curious about what you're referring to and how organizations are now focusing on the enterprises? What is the main focus and transition to that enterprise?

I'm not entirely sure; it's been a while since I've covered this stuff. I haven't spoken to as many people working in the enterprise on my podcasts because I was more enthused about the consumer-type applications. I understand that you have to think about these core technologies as more than just smart speakers. They're not just consumer devices that you transact with in your home, but the natural language understanding, for example, the transcription or the emotion detection, can be used in the backend for big enterprise installations, like call centers.

To give you an example, there's a company that I worked at called Batvoice AI, which does a dashboard that helps customer service agents do their job better by measuring the emotion, the intent, and various other characteristics of both the agent and the customer that they're speaking to. They can give a live dashboard for the agent to say, "this customer's getting more irate, or this customer is more likely to buy," so they can adapt.

The same thing can be used in the IVR, the interactive voice response. That it's moving from pressing keys with these huge menus, to asking you what you want, you telling it, and then it, hopefully, accurately interpreting what you want, and directing you through to the right service. Also, that emotion detection is really useful to be able to hear whether the customer is annoyed. What you really want is for people to be circumventing your technology.