The goal of how AI is built has been litigated

In late June, Microsoft released a new type of artificial intelligence technology that can generate its own computer code.

Called Copilot, the tool was designed to speed up the work of professional programmers. As they type on their laptops, it will suggest ready-made blocks of computer code that they can instantly add their own.

Many programmers liked the new tool or were at least intrigued by it. But Matthew Butterick, a programmer, designer, writer and lawyer from Los Angeles, was not one of them. This month, he and a group of other lawyers filed a lawsuit seeking class-action status against Microsoft and other high-profile companies that designed and deployed Copilot.

Like many cutting-edge AI technologies, Copilot developed its skills by analyzing large amounts of data. In this case, it relies on billions of lines of computer code posted on the Internet. Mr Butterick, 52, equates the process with piracy, as the system does not acknowledge its debt to existing works. His lawsuit claims that Microsoft and its affiliates violated the legal rights of millions of programmers who spent years writing original code.

The case is believed to be the first legal attack on a design technique called “AI training,” a way to create artificial intelligence poised to reshape the tech industry. In recent years, many artists, writers, pundits and privacy activists have complained that companies are training their AI systems using data they don’t own.

The case has echoed in the tech industry over the past few decades. In the 1990s and 2000s, Microsoft fought the rise of open source software, seeing it as an existential threat to the future of the company’s business. As open source grew in importance, Microsoft embraced it and even acquired GitHub, a home for open source programmers and a place where they create and store their code.

Almost every new generation of technology – even online search engines – has faced similar legal challenges Often, “there is no statute or case law that covers it,” said Bradley J. Hulbert, an intellectual property lawyer who specializes in this increasingly important area of ​​law.

The case is part of the groundswell of concern over artificial intelligence. Artists, writers, composers and other creative types are increasingly concerned that companies and researchers are using their work to create new technologies without their consent and without compensation. Companies train a variety of systems this way, including art generators, speech recognition systems like Siri and Alexa, and even driverless cars.

CoPilot is based on technology developed by OpenAI, an artificial intelligence lab in San Francisco backed by a billion dollars in funding from Microsoft. OpenAI is at the forefront of an increasingly comprehensive effort to train artificial intelligence technologies using digital data.

After Microsoft and GitHub released Copilot, GitHub’s CEO, Nat Friedman, Tweet That using existing code to train the system was a “fair use” of the material under copyright law, an argument often used by companies and researchers developing these systems. But no court case has yet tested this argument.

“The ambitions of Microsoft and OpenAI go beyond GitHub and Copilot,” Mr. Butterick said in an interview. “They want to train on any data anywhere, for free, without consent, forever.”

In 2020, OpenAI unveiled a system called GPT-3. The researchers trained the system using large amounts of digital text, including thousands of books, Wikipedia articles, chat logs and other data posted on the Internet.

By identifying patterns in all the text, this system learns to predict the next word in a sequence. When someone types a few words into this “large language model,” it can complete a thought with an entire paragraph of text. Thus, the system can write its own Twitter posts, speeches, poems and news articles.

To the surprise of the researchers who developed the system, it can also write computer programs, apparently having learned from numerous programs posted on the Internet.

So OpenAI went a step further, training a new system, Codex, on a new collection of data specifically stocked with code. At least some of this code, the lab later said in a research paper detailing the technology, came from GitHub, a popular programming service owned and operated by Microsoft.

This new system became the underlying technology for Copilot, which Microsoft distributed to programmers via GitHub. After about a year of testing with a relatively small number of programmers, Copilot was brought to all coders on GitHub in July.

For now, the code that CoPilot generates is simple and can be useful for a large project, but must be edited, expanded and verified, say many programmers who have used the technology. Some programmers find it useful when learning to code or trying to master a new language.

Still, Mr. Butterick worries that Copilot will destroy the global community of programmers who have created the code at the heart of much of modern technology. A few days after the system was released, he published a blog post titled: “This copilot is stupid and wants to kill me.”

Mr. Butterick identifies as an open source programmer, part of a community of programmers who openly share their code with the world. Over the past 30 years, open source software has helped shape much of the technology consumers use every day, including web browsers, smartphones, and mobile apps.

Although open source software is designed to be shared freely between coders and companies, this sharing is governed by licenses designed to ensure that it is used in a way that benefits the larger community of programmers. Mr. Butterick believes that Copilot violates these licenses and will make open source coders obsolete as it continues to evolve.

After complaining publicly about the matter for several months, he filed his case with a handful of other lawyers. The case is still in its early stages and the court has not yet given it class-action status.

To the surprise of many legal experts, Mr. Butterick’s lawsuit does not accuse Microsoft, GitHub and OpenAI of copyright infringement. His lawsuit takes a different tack, arguing that the companies violated GitHub’s terms of service and privacy policies and run afoul of a federal law that requires companies to display copyright information when using material.

Mr. Butterick and another lawyer behind the suit, Joe Savery, said the case could ultimately deal with copyright issues.

Asked if the company could discuss the lawsuit, a GitHub spokeswoman declined before saying in an emailed statement that the company “has been committed to responsibly innovating with Copilot since the beginning, and will continue to evolve the product to best serve developers around the world.” Microsoft and OpenAI declined to comment on the lawsuit.

Under existing law, most experts believe, training an AI system on copyrighted material is not necessarily illegal. But it can be done if the system generates material that is as good as the data it is trained on.

Copilot has some users said It produces code that appears identical — or nearly identical — to existing programs, an observation that may become central to Mr. Butterick’s case and others.

Pam Samuelson, a professor at the University of California, Berkeley, who specializes in intellectual property and its role in modern technology, said legal thinkers and regulators briefly explored these legal issues in the 1980s, before the technology existed. Now, he said, a legal assessment is needed.

“This is no longer a toy problem,” says Dr. Samuelson.