For the previous 500 years, the Nationwide Library of Sweden has collected nearly each phrase revealed in Swedish, from priceless medieval manuscripts to present-day pizza menus.
Because of a centuries-old regulation that requires a duplicate of the whole lot revealed in Swedish to be submitted to the library — often known as Kungliga biblioteket, or KB — its collections span from the apparent to the obscure: books, newspapers, radio and TV broadcasts, web content material, Ph.D. dissertations, postcards, menus and video video games. It’s a wildly numerous assortment of almost 26 petabytes of knowledge, excellent for coaching state-of-the-art AI.
“We are able to construct state-of-the-art AI fashions for the Swedish language since we’ve the very best information,” stated Love Börjeson, director of KBLab, the library’s information lab.
Utilizing NVIDIA DGX techniques, the group has developed greater than two dozen open-source transformer fashions, accessible on Hugging Face. The fashions, downloaded by as much as 200,000 builders per thirty days, allow analysis on the library and different tutorial establishments.
“Earlier than our lab was created, researchers couldn’t entry a dataset on the library — they’d have to have a look at a single object at a time,” Börjeson stated. “There was a necessity for the library to create datasets that enabled researchers to conduct quantity-oriented analysis.”
With this, researchers will quickly be capable to create hyper-specialized datasets — for instance, pulling up each Swedish postcard that depicts a church, each textual content written in a specific model or each point out of a historic determine throughout books, newspaper articles and TV broadcasts.
Turning Library Archives Into AI Coaching Information
The library’s datasets symbolize the total range of the Swedish language — together with its formal and casual variations, regional dialects and modifications over time.
“Our influx is steady and rising — each month, we see greater than 50 terabytes of recent information,” stated Börjeson. “Between the exponential development of digital information and ongoing work digitizing bodily collections that date again a whole lot of years, we’ll by no means be completed including to our collections.”

Quickly after KBLab was established in 2019, Börjeson noticed the potential for coaching transformer language fashions on the library’s huge archives. He was impressed by an early, multilingual, pure language processing mannequin by Google that included 5GB of Swedish textual content.
KBLab’s first mannequin used 4x as a lot — and the workforce now goals to coach its fashions on at the very least a terabyte of Swedish textual content. The lab started experimenting by including Dutch, German and Norwegian content material to its datasets after discovering {that a} multilingual dataset could enhance the AI’s efficiency.
NVIDIA AI, GPUs Speed up Mannequin Improvement
The lab began out utilizing consumer-grade NVIDIA GPUs, however Börjeson quickly found his workforce wanted data-center-scale compute to coach bigger fashions.

The lab has two NVIDIA DGX techniques from Swedish supplier AddPro for on-premises AI improvement. The techniques are used to deal with delicate information, conduct large-scale experiments and fine-tune fashions. They’re additionally used to arrange for even bigger runs on huge, GPU-based supercomputers throughout the European Union — together with the MeluXina system in Luxembourg.
“Our work on the DGX techniques is critically necessary, as a result of as soon as we’re in a high-performance computing setting, we need to hit the bottom operating,” stated Börjeson. “We now have to make use of the supercomputer to its fullest extent.”
The workforce has additionally adopted NVIDIA NeMo Megatron, a PyTorch-based framework for coaching giant language fashions, with NVIDIA CUDA and the NVIDIA NCCL library below the hood to optimize GPU utilization in multi-node techniques.
“We rely to a big extent on the NVIDIA frameworks,” Börjeson stated. “It’s one of many large benefits of NVIDIA for us, as a small lab that doesn’t have 50 engineers accessible to optimize AI coaching for each undertaking.”
Harnessing Multimodal Information for Humanities Analysis
Along with transformer fashions that perceive Swedish textual content, KBLab has an AI device that transcribes sound to textual content, enabling the library to transcribe its huge assortment of radio broadcasts in order that researchers can search the audio information for particular content material.

KBLab can be beginning to develop generative textual content fashions and is engaged on an AI mannequin that might course of movies and create computerized descriptions of their content material.
“We additionally need to hyperlink all of the completely different modalities,” Börjeson stated. “Whenever you search the library’s databases for a particular time period, we should always be capable to return outcomes that embrace textual content, audio and video.”
KBLab has partnered with researchers on the College of Gothenburg, who’re creating downstream apps utilizing the lab’s fashions to conduct linguistic analysis — together with a undertaking supporting the Swedish Academy’s work to modernize its data-driven methods for creating Swedish dictionaries.
“The societal advantages of those fashions are a lot bigger than we initially anticipated,” Börjeson stated.
Photos courtesy of Kungliga biblioteket