Connect with us

Technology

How AI can make history

A scholar of 18th-century history was overwhelmed by piles of letters, journals, and legal documents. He tried using AI on a whim — and found it surprisingly useful.

Published by Web Desk

Published

on

Part of /
AI at Work

Like millions of other people, the first thing Mark Humphries did with ChatGPT when it was released in late 2022 was ask it to perform parlor tricks, like writing poetry in the style of Bob Dylan — which, while very impressive, did not seem particularly useful to him, a historian studying the 18th-century fur trade. But Humphries, a 43-year-old professor at Wilfrid Laurier University in Waterloo, Canada, had long been interested in applying artificial intelligence to his work. He was already using a specialized text recognition tool designed to transcribe antiquated scripts and typefaces, though it made frequent errors that took time to correct. Curious, he pasted the tool’s garbled interpretation of a handwritten French letter into ChatGPT. AI corrected the text, fixing all the Fs that had been misread as an S and even adding missing accents. Then Humphries asked ChatGPT to translate it to English. It did that, too. Maybe, he thought, this thing would be useful after all. 

For Humphries, AI tools held a tantalizing promise. Over the last decade, millions of documents in archives and libraries have been scanned and digitized — Humphries was involved in one such effort himself — but because their wide variety of formats, fonts, and vocabulary rendered them impenetrable to automated search, working with them required stupendous amounts of manual research. For a previous project, Humphries pieced together biographies for several hundred shellshocked World War I soldiers from assorted medical records, war diaries, newspapers, personnel files, and other ephemera. It had taken years and a team of research assistants to read, tag, and cross-reference the material for each individual. If new language models were as powerful as they seemed, he thought, it might be possible to simply upload all this material and ask the model to extract all the documents related to every soldier diagnosed with shell shock.

“That’s a lifetime’s work right there, or at least a decade,” said Humphries. “And you can imagine scaling that up. You could get an AI to figure out if a soldier was wounded on X date, what was happening with that unit on X date, and then access information about the members of that unit, that as historians, you’d never have the time to chase down on an individual basis,” he said. “It might open up new ways of understanding the past.” 

Improved database management may be a far cry from the world-conquering superintelligence some predict, but it’s characteristic of the way language models are filtering the real world. From law to programming to journalism, professionals are trying to figure out whether and how to integrate this promising, risky, and very weird technology into their work. For historians, a technology capable of synthesizing entire archives that also has a penchant for fabricating facts is as appealing as it is terrifying, and the field, like so many others, is just beginning to grapple with the implications of such a potentially powerful but slippery tool.

AI seemed to be everywhere at the 137th annual meeting of the American Historical Association last month, according to Cindy Ermus, an associate professor of history at the University of Texas at San Antonio. She chaired one of several panels on the topic. Ermus described her and many of her colleagues’ relationship to AI as that of “curious children,” wondering with both excitement and wariness what aspects of their work it will change and how. “It’s going to transform every part of historical research, from collection, to curation, to writing, and of course, teaching,” she said. She was particularly impressed by Lancaster University lecturer Katherine McDonough’s presentation of a machine learning program capable of searching historic maps, initially trained on ordnance surveys of 19th-century Britain. 

“It’s going to transform every part of historical research, from collection, to curation, to writing, and of course, teaching.”

“She searched the word ‘restaurant,’ and it pulled up the word ‘restaurant’ in tons of historical maps through the years,” Ermus said. “To the non-historian, that might not sound like a big deal, but we’ve never been able to do that before, and now it’s at our fingertips.” 

Another attendee, Lauren Tilton, professor of liberal arts and digital humanities at the University of Richmond, had been working with machine learning for over a decade and recently worked with the Library of Congress to apply computer vision to the institution’s vast troves of minimally labeled photos and films. All archives are biased — in what material is saved to begin with and in how it is organized. The promise of AI, she said, is that it can open up archives at scale and make them searchable for things the archivists of the past didn’t value enough to label. 

“The most described materials in the archive are usually the sort of voices we’ve heard before — the famous politicians, famous authors,” she said. “But we know that there are many stories by people of minoritized communities, communities of color, LGBTQ communities that have been hard to tell, not because people haven’t wanted to, but because of the challenges of how to search the archive.”

AI systems have their own biases, however. They have the well-documented tendency to reflect the gender, racial, and other biases of their training data — the fact that, as Ermus pointed out, when she asked GPT-4 to create an image of a history professor, it drew an elderly white man with elbow patches on his blazer — but they also display a bias that Tilton calls “presentism.” Because the vast preponderance of training data is scraped from the contemporary internet, models reflect a contemporary worldview. Tilton encountered this phenomenon when she found image recognition systems struggled to make sense of older photos, for example, labeling typewriters as computers and their paperweights as their mice. These were image recognition systems, but language models have a similar problem. 

Impressed with ChatGPT, Humphries signed up for the OpenAI API and set out to make an AI research assistant. He was trying to track 18th-century fur traders through a morass of letters, journals, marriage certificates, legal documents, parish records, and contracts in which they appear only fleetingly. His goal was to design a system that could automate the process.

One of the first challenges he encountered was that 18th-century fur traders do not sound anything like a language model assumes

One of the first challenges he encountered was that 18th-century fur traders do not sound anything like a language model assumes. Ask GPT-4 to write a sample entry, as I did, and it will produce lengthy reflections on the sublime loneliness of the wilderness, saying things like, “This morn, the skies did open with a persistent drizzle, cloaking the forest in a veil of mist and melancholy,” and “Bruno, who had faced every hardship with the stoicism of a seasoned woodsman, now lay still beneath the shelter of our makeshift tent, a silent testament to the fragility of life in these untamed lands.”

Whereas an actual fur trader would be far more concise. For example, “Fine Weather. This morning the young man that died Yesterday was buried and his Grave was surrounded with Pickets. 9 Men went to gather Gum of which they brought wherewith to Gum 3 Canoes, the others were employed as yesterday,” as one wrote in 1806, referring to gathering tree sap to seal the seams of their bark canoes. 

“The problem is that the language model wouldn’t pick up on a record like that, because it doesn’t contain the type of reflective writing that it’s trained to see as being representative of an event like that,” said Humphries. Trained on contemporary blog posts and essays, it would expect the death of a companion to be followed by lengthy emotional remembrances, not an inventory of sap supplies.

By fine-tuning the model on hundreds of examples of fur trader prose, Humphries got it to pull out journal entries in response to questions, but not always relevant ones. The antiquated vocabulary still posed a problem — words like varangue, a French term for the rib of a canoe that would rarely appear in the model’s training data, if ever. 

After much trial and error, he ended up with an AI assembly line using multiple models to sort documents, search them for keywords and meaning, and synthesize answers to queries. It took a lot of time and a lot of tinkering, but GPT helped teach him the Python he needed. He named the system HistoryPearl, after his smartest cat. 

He tested his system against edge cases, like the Norwegian trader Ferdinand Wentzel, who wrote about himself in the third person and deployed an odd sense of humor, for example, writing about the birth of his son by speculating about his paternity and making self-deprecating jokes about his own height — “F. W.’s Girl was safely delivered of a boy. - I almost believe it is his Son for his features seem to bear some resemblance of him & his short legs seem to determine this opinion beyond doubt.” This sort of writing stymied earlier models, but HistoryPearl could pull it up in response to a vaguely phrased question about Wentzel’s humor, along with other examples of Wentzel’s wit Humphries hadn’t been looking for. 

The tool still missed some things, but it performed better than the average graduate student Humphries would normally hire to do this sort of work. And faster. And much, much cheaper. Last November, after OpenAI dropped prices for API calls, he did some rough math. What he would pay a grad student around $16,000 to do over the course of an entire summer, GPT-4 could do for about $70 in around an hour. 

“They’re still talking about the technology as if it is a theoretical thing without the full understanding that it poses a very real, existential threat to our whole raison d’être as higher educators.”

“That was the moment where I realized, ‘Okay, this begins to change everything,’” he said. As a researcher, it was exciting. As a teacher, it was frightening. Organizing fur trading records may be a niche application, but a huge number of white collar jobs consist of similar information management tasks. His students were supposed to be learning the sorts of research and thinking skills that would allow them to be successful in just these sorts of jobs. In November, he published a newsletter imploring his peers in academia to take the rapid development of AI seriously. “AI is simply starting to outrun many people’s imaginations,” he wrote. “They’re still talking about the technology as if it is a theoretical thing without the full understanding that it poses a very real, existential threat to our whole raison d’être as higher educators.”

In the meantime, though, he was pleased that his tinkering had resulted in what he calls a “proof of concept”: reliable enough to be potentially useful, though not yet enough to fully trust. Humphries and his research partner, the historian Lianne Leddy, submitted a grant to scale their research up to all 30,000 voyageurs in their database. In a way, he found the labor required to develop this labor-saving system comforting. The largest improvements in the model came from feeding it the right data, something he was able to do only because of his expertise in the material. Lately, he has been thinking that there may actually be more demand for domain experts with the sort of research and critical assessment skills the humanities teach. This year he will teach an applied generative AI program he designed, run out of the Faculty of Arts. 

“In some ways this is old wine in new bottles, right?” he said. In the mid 20th century, he pointed out, companies had vast corporate archives staffed by researchers who were experts, not just in storing and organizing documents, but in the material itself. “In order to make a lot of this data useful, people are needed who have both the ability to figure out how to train models, but more importantly, who understand what is good content and what’s not. I think that’s reassuring,” he said. “Whether I’m just deluding myself, that’s another question.”

Comments
Continue Reading

Technology

Cox Communications won’t have to pay $1 billion to record labels after all

But it will have to eventually pay something.

Published by Web Desk

Published

on

In the seemingly endless fight between record labels and ISPs over music piracy, the Fourth Circuit Court of Appeals in Richmond, Virginia decided Tuesday that $1 billion is too much for Cox Communications to pay record labels in damages. Instead, as reported by Reuters, a new trial should be set in a federal district court to figure out what would be an appropriate amount.

This new ruling overturns a 2019 US district court jury’s decision siding with the record labels involved in the lawsuit, which includes Sony Music, Universal Music Group, Warner Music Group, and EMI. The companies accused Cox of not addressing over 10,000 copyright infringement notices and failing to take action against music pirates, such as cutting off their broadband access. But the circuit court reversed the damages, noting that Cox “did not profit from its subscribers’ acts of infringement,” a legal prerequisite for part of the liability.

This is not the first time Cox Communications has tried to appeal that $1 billion judgement, but it is the first time it has been successful. Cox previously asked a federal court in Virginia to lower the damages or give it a new trial. When that court said no, the ISP filed a motion with a district court in Colorado claiming Sony fabricated evidence to obtain a favorable verdict.

The evidence in question was used in another music copyright infringement case against another ISP, Charter, and Cox sought to prove that evidence was created years after the music companies claimed it was illegally downloaded over Cox’s network. However, this allegation was not mentioned in the circuit court’s opinion Tuesday.

Neither music companies nor ISPs have been able to do much to stop repeat pirates; both parties mutually decided to end their Copyright Alert System partnership (known as the “six strikes” rule) in 2017 after it failed to significantly reduce illegal music and video downloads. The system was successful at getting internet users who infrequently pirated copyright material, but it didn’t do anything against the ones who consistently pirated material.

Comments
Continue Reading

Pakistan

Light rain-wind/thunderstorm likely at isolated places in upper KP, GB, Kashmir

Mainly cold and dry weather is expected in most parts of the country, while very cold in upper parts.

Published by Hussnain Bhutta

Published

on

Islamabad: Light rain-wind/thunderstorm with light snowfall over hills is expected at isolated places in upper Khyber Pakhtunkhwa, Gilgit Baltistan, Kashmir and adjoining hilly areas according to the forecast of Pakistan Meteorological Department (PMD).

Mainly cold and dry weather is expected in most parts of the country, while very cold in upper parts.

According to the synoptic situation, a shallow westerly wave was still present over Kashmir and adjoining areas.

During the last 24 hours, cold and dry weather prevailed over most parts of the country, while very cold in upper parts.

Rain-thunderstorm with snowfall over hills occurred at isolated places in upper Khyber Pakhtunkhwa, Kashmir and adjoining hilly areas.

The rainfall recorded during the period was Khyber Pakhtunkhwa: Kakul 08mm, Punjab: Murree 04, Kashmir: Rawalakot 01mm.

The snowfall recorded was 0.5 inches in Murree.

The lowest temperatures recorded were Leh -14C, Kalam -13, Astore -09, Skardu -06, Gupis -05, Malam Jabba, Bagrote, Hunza -04, Rawalakot, Chitral, Dir -03, Murree and Gilgit -02C.

 

 

 

Continue Reading

Trending

Take a poll