We present Kosmos-2.5, a multimodal literate model for machine reading o...
We explore how continued pre-training on domain-specific corpora influen...
The increasing volume of log data produced by software-intensive systems...
Large language models (LLMs) have recently garnered significant interest...
In this work, we propose Retentive Network (RetNet) as a foundation
arch...
Scaling sequence length has become a critical demand in the era of large...
We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enablin...
Music representation learning is notoriously difficult for its complex
h...
Recent studies have shown that dual encoder models trained with the
sent...
ELECTRA, the generator-discriminator pre-training framework, has achieve...
Modern systems produce a large volume of logs to record run-time status ...
Large Language Models (LLMs) are popular for their impressive abilities,...
A big convergence of language, multimodal perception, action, and world
...
Position modeling plays a critical role in Transformers. In this paper, ...
Pre-trained models have achieved remarkable success in natural language
...
Large Transformers have achieved state-of-the-art performance across man...
In this paper, we elaborate upon recipes for building multilingual
repre...
Named entity recognition (NER) suffers from the scarcity of annotated
tr...
A big convergence of model architectures across language, vision, speech...
Sparsely Mixture of Experts (MoE) has received great interest due to its...
Foundation models have received much attention due to their effectivenes...
The sparse Mixture-of-Experts (MoE) model is powerful for large-scale
pr...
As more and more pre-trained language models adopt on-cloud deployment, ...
Sparse mixture of experts provides larger model capacity while requiring...
In this paper, we propose a simple yet effective method to stabilize
ext...
Knowledge-Enhanced Model have developed a diverse set of techniques for
...
The poor performance of the original BERT for sentence semantic similari...
This report describes Microsoft's machine translation systems for the WM...
While pre-trained language models have achieved great success on various...
Compared to monolingual models, cross-lingual models usually require a m...
In this paper, we introduce ELECTRA-style tasks to cross-lingual languag...
While pretrained encoders have achieved success in various natural langu...
Large pre-trained models have achieved great success in many natural lan...
Fine-tuning pre-trained cross-lingual language models can transfer
task-...
The cross-lingual language models are typically pretrained with masked
l...
We generalize deep self-attention distillation in MiniLM (Wang et al., 2...
Commonsense explanation generation aims to empower the machine's sense-m...
Despite the success of generative pre-trained language models on a serie...
Document layout analysis usually relies on computer vision models to
und...
Pre-training techniques have been verified successfully in a variety of ...
We present TableBank, a new image-based table detection and recognition
...
In this paper, we introduce a novel natural language generation task, te...
In this paper, we study a novel task that learns to compose music from
n...
Sentence scoring and sentence selection are two main steps in extractive...
An intuitive way for a human to write paraphrase sentences is to replace...
Open domain response generation has achieved remarkable progress in rece...