Microsoft’s Sarah Bird: Core pieces are still missing from artificial general intelligence

By News Room Last updated Dec 18, 2024

Sarah Bird’s role at technology group Microsoft is to ensure the artificial intelligence ‘Copilot’ products it releases — and its collaborative work with OpenAI — can be used safely. That means ensuring they cannot cause harm, treat people unfairly, or be used to spread incorrect or fake content.

Her approach is to draw on customer feedback from dozens of pilot programmes, to understand the problems that might emerge and make the experience of using AI more engaging. Recent improvements include a real time system for detecting instances where an AI model is ‘hallucinating’ or generating fictional outputs.

Here, Bird tells the FT’s technology reporter Cristina Criddle why she believes generative AI has the power to lift people up — but artificial general intelligence still struggles with basic concepts, such as the physical world.

Cristina Criddle: How do you view generative AI? Is it materially different to other types of AI that we’ve encountered? Should we be more cognisant of the risk it poses?

Sarah Bird: Yes, I think generative AI is materially different and more exciting than other AI technology, in my opinion. The reason is that it has this amazing ability to meet people where they are. It speaks human language. It understands your jargon. It understands how you are expressing things. That gives it the potential to be the bridge to all other technologies or other complex systems.

We can take someone who, for example, has never programmed before and actually allow them to control a computer system as if they were a programmer. Or you can take someone who, for example, is in a vulnerable situation and needs to navigate government bureaucracy, but doesn’t understand all the legal jargon — they can express their questions in their own language and they can get answers back in a way that they understand.

I think the potential for lifting people up and empowering people is just enormous with this technology. It actually speaks in a way that is human and understands in a way that feels very human — [that] really ignites people’s imagination around the technology.

We’ve had science fiction forever that shows humanoid AIs wreaking havoc and causing different issues. It’s not a realistic way to view the technology, but many people do. So, compared to all of the other AI technologies before, we see so much more fear around this technology for those reasons.

CC: It seems to be transformative for some tasks, especially in our jobs. How do you view the impact it will have on the way we work?

SB: I think that this technology is absolutely going to change the way people work. We’ve seen that with every technology. One of the perfect examples is calculators. Now, it’s still important in education for me to understand how to do that type of math, but day to day I’m not going to do it by hand. I’m going to use a calculator because it saves me time and allows me to focus my energy on what’s the most important.

We are absolutely seeing this in practice, [with generative AI], as well. One of the applications we released first was GitHub Copilot. This is an application that completes code. In the same way that it helps auto complete your sentences when you’re typing an email, this is autocompleting your code. Developers say that they’re going 40 per cent faster using this and — something that’s very, very important to me — they are 75 per cent more satisfied with their work.

We very much see the technology removing the drudgery, removing the tasks that you didn’t like doing, anyway — allowing everybody to focus on the part where they’re adding their unique differentiation, adding their special element to it, rather than the part that was just repeated and is something that AI can learn.

CC: You’re on the product side. How do you balance getting a product out and making sure that people have access to it versus doing proper testing, and making sure it’s entirely safe and mitigating the risks?

SB: I love this question. The trade off between when to release the technology and get it in people’s hands versus when to keep doing more work [on it] is one of the most important decisions we make. I shared earlier that I think that technology has the potential to make everyone’s lives better. It is going to be hugely impactful in so many people’s lives.

For us, and for me, that means it’s important to get the technology in people’s hands as soon as possible. We could give millions of talks about this technology and why it’s important. But, unless people touch it, they actually don’t have an opinion about how it should fit in their lives, or how it should be regulated, or any of these things.

That’s why the ChatGPT moment was so powerful, because it was the first moment that the average person could easily touch the technology and really understand [it]. Then, suddenly, there was enormous excitement, there was concern, there were many different conversations started. But they didn’t really start until people could touch the technology.

We feel that it’s important to bring people into the conversation because the technology is for them and we want to learn truly from how they’re using it and what’s important to them — not just our own ideas in the lab. But, of course, we don’t want to put any technology in people’s hands that’s really not ready or is going to cause harm.

We do as much work as we can to upfront identify those risks, build tests to ensure that we’re actually addressing those risks, building mitigations as well. Then, we roll it out slowly. We test internally. We go to a smaller group. Each of these phases we learn, we make sure that it’s working as expected. Then, if we see that it is, then we can go to a wider audience.

We try to move quickly — but with the appropriate data, with the appropriate information, and making sure that we’re learning in each step and we’re scaling as our confidence grows.

CC: OpenAI is a strategic partner of yours. It’s one of the key movers in the space. Would you say that your approaches to responsible AI are aligned?

SB: Yes, absolutely. One of the reasons early on that we picked OpenAI to partner with is because our core values around responsible AI and AI safety are very aligned.

Now, the nice thing about any partnership is we bring different things to the table. For example, OpenAI’s big strength is the core model development. They’ve put a lot of energy in advancing state of the art safety alignment in the model itself, where we are building a lot of complete AI applications.

We’ve focused on the layers you need to implement to get to an application. Adding things like an external safety system for when the model makes a mistake, or adding monitoring or abuse detection, so that your security team can investigate issues.

We each explore in these different directions and then we get to share what we’ve learned. We get the best of both of our approaches, as a result. It’s a really fruitful and collaborative partnership.

CC: Do you think we’re close to artificial general intelligence?

SB: This is my personal answer, but I think AGI is a nongoal. We have a lot of amazing humans on this planet. And so, the reason I get out of bed every day is not to replicate human intelligence. It’s to build systems that augment human intelligence.

It’s very intentional that Microsoft has named our flagship AI systems ‘co-pilots’, because they’re about AI working together with a human to achieve something more. So much of our focus is about ensuring AI can do things well that humans don’t do well. I spend a lot more time thinking about that than the ultimate AGI goal.

CC: When you say AGI is a nongoal, do you still think it’s likely to happen?

SB: It’s really hard to predict when a breakthrough is going to come. When we got GPT4, it was a huge jump over GPT3 — so much more than anybody expected. That was exciting and amazing, even for people like myself that have worked in generative AI for a long time.

Will the next generation of models be as big of a jump? We don’t know. We’re going to push the techniques as far as we can and see what’s possible. I just take every day as it comes.

But my personal opinion is I think there are still fundamental things that have to be figured out before we could cross a milestone like AGI. I think we’ll really keep pushing in the directions we’ve gone, but I think we’ll see that run out and we’ll have to invent some other techniques as well.

CC: What do we need to figure out?

SB: It still feels like there’s core pieces missing in the technology. If you touch it, it’s magical — it seems to understand so much. Then, at the same time, there’s places where it feels like it doesn’t understand basic concepts. It doesn’t get it. An easy example is that it doesn’t really understand physics or the physical world.

For each of these core pieces that are missing, we have to go figure out how to solve that problem. I think some of those will need new techniques, not just the same thing we’re doing today.

CC: How do you think about responsibility and safety with these new systems that are meant to be our co-pilots, our agents, our assistant? Do you have to think about different kinds of risks?

SB: Everybody is really excited about the potential of agentic systems. Certainly, as AI becomes more powerful, we have the challenge that we need to figure out how to make sure it’s doing the right thing. One of the main techniques we use today — that you see in all of the co-pilots — is human oversight. You’re looking at whether or not you want to accept that email suggestion.

If the AI starts doing more complex tasks where you actually don’t know the right answer, then it’s much harder for you to catch an error.

That level of automation where you’re not actually watching, and [the AI is] just taking actions, it completely raises the bar in terms of the amount of errors that you can tolerate. You have to have extremely low amounts of those.

You could be taking an action that has real-world impact. So we need to look at a much broader risk space in terms of what’s possible.

On the agents front, we’re going to take it step by step and see where is it really ready, where can we get the appropriate risk-reward trade-off. But it’s going to be a journey to be able to realise the complete vision where it can do many, many different things for you and you trust it completely.

CC: It has to build quite a good profile of you as an individual to be able to take actions on your behalf. So there is a personalisation point you have to think about, as well and how much consumers and businesses are comfortable with that.

SB: That is actually one of the things that I love about the potential of AI. One of the things we’ve seen as a challenge in many computing systems is the fact they were really designed for one person, one persona.

If the built-in workflow works in the way you think, great: you can get enormous benefit from that. But, if you think a little differently, or you come from a different background, then you don’t get the same benefit as others from the technology.

This personalisation where it’s now about you and what you need — as opposed to what the system designer thought you needed — I think is huge. I often think of the personalisation as a great benefit in responsible AI and how we make technology more inclusive.

That said, we have to make sure that we’re getting the privacy and the trust stories right, to make sure people are going to feel great benefit from that personalisation and not have concerns about it.

CC: That’s a really good point. I guess you need wide adoption to be able to level the system in terms of bias.

SB: To test for bias, it’s important that you look both at aggregates and specific examples. A lot of it is about going deep, and understanding lived experiences, and what’s working — or not.

But we also want to look at the numbers. It might be that I happen to be a woman and I’m having a great experience using it but, on average, women are having a worse experience than men. So we look both at the specifics and also the generalities when we look at something like bias.

But, certainly, the more people that use the system, the more we learn from their experiences. Also, part of getting that technology out into people’s hands early is to help us get that learning going so we can really make sure the system is mature and it’s behaving the way people want every time.

CC: Would you say that bias is still a primary concern for you with AI?

SB: Bias, or fairness, is one of our responsible AI principles. So, bias is always going to be something that we need to think about with any AI system. However, it manifests in different ways.

When we were looking at the previous wave of AI technologies, like speech to text or facial recognition, we were really focused on what we call quality of service fairness. When we look at generative AI, it’s a different type of fairness. It’s how people are represented in the AI system. Is a system representing people in a way that is disparaging, demeaning, stereotyping? Are they over-represented? Are they under-represented? Are they erased?

So we build out different approaches for testing based on the type of fairness we’re looking for. But fairness is going to be something we care about in every AI system.

CC: Hallucinations are a risk that we’ve known for a while now with gen AI. How far have we come since its emergence to improve the level of hallucinations that we see in these models?

SB: We’ve come a long way. When we first started looking at this, we didn’t even know what a hallucination was really like, or what should be considered a hallucination. We decided that a hallucination, in most applications, is where the response does not line up with the input data.

That was a really intentional decision: we said an important way to address the risk of hallucinations is making sure that you’re giving fresh, accurate, high authoritative data to the system to respond with. Then, the second part is making sure that then it uses that data effectively. We’ve innovated a lot in techniques to help the model stay focused on the data we give it and to ensure it’s responding based on that.

We released new capabilities just this last month that I’m really excited about, which detect when there’s a mismatch between the data and the model’s response and correct it in real time — so we get a correct answer instead of something with a mistake in it.

That’s something that’s only been really possible in practice quite recently. We’ll keep pushing the boundary so that we can get lower and lower rates of mistakes in our AI systems.

CC: Do you think you’ll be able to eradicate the hallucination issue?

SB: I think [by] having the model respond based on data it’s given, we can get that to [a level] that’s extremely low. If you’re saying do we want an AI model to never fabricate anything, I think that would be a mistake. Because another thing that’s great about the technology is its ability to help you imagine things, to write a creative poem, to write a fictional story.

You don’t necessarily want all of that to be grounded in facts. You want to make something new. We still want the AI system to be able to do that, it just may not be appropriate in every application.

CC: [Microsoft chief executive] Satya Nadella said that, as AI becomes more authentic, models are going to become more of a commodity. Is that something that you agree with?

SB: I think eventually the rate of innovation slows down and then you end up with many more models on the frontier. We’ve seen open source models, for example, move very quickly behind the state of the art models — making that capability available to everyone for their own hosting and their own use. I think we’re going to see that happen.

We very much [believe] the model is not the goal, the application is the goal. Regardless of what model you’re using, there’s still a lot you need to do to get to a complete AI application. You need to build safety and you need to test it. You need to be able to monitor it. You need to be able to audit it and provide information to your regulators.

CC: You had to deal with the fallout from the Taylor Swift deepfake. Can you talk me through how you tracked this and how you stopped it?

SB: If we’re looking at deepfakes and adversarial use of our systems, we’re constantly looking at the conversations that are happening out in the wild — and [asking] if what we’re seeing in the wild is possible in our system.

We test our systems to make sure that the defences we have in place are holding. We have layers that try to prevent particular outputs — for example, celebrity outputs or different types of sexual content. We’re constantly updating those based on attacks we see and how people are trying to get through.

But another really important investment for us in this space is what we call content credentials. We want to make sure that people understand the source of content. Our content credentials actually watermark and sign whether or not an AI image has been generated by Microsoft. We use that to identify if some of the images we’re seeing in the wild actually came from our system.

That can help people understand that it’s an AI image and not a real image. But it also helps us identify if there’s still gaps. If we’re seeing images come out in the wild that aren’t something that would come out of our system, we analyse those and use them to help update our defences.

Deepfakes are hard problems, and so, we look holistically at every way we can help address this. But there’s definitely no silver bullet in this space.

Read the full article here