Reader

Apple Machine Learning Research

FastVLM: Efficient Vision encoding for Vision Language Models
Disentangled Representational Learning with the Gromov-Monge Gap
Scaling Laws for Native Multimodal Models
Step-by-Step Diffusion: An Elementary Tutorial
DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation
International Conference on Learning Representations (ICLR) 2025
Scaling Diffusion Language Models via Adaptation from Autoregressive Models
CoMotion: Concurrent Multi-Person 3D Motion
EC-DIT: Scaling Diffusion Transformers with Adaptive Expert-Choice Routing
TIS-DPO: Token-level Importance Sampling for Direct Preference Optimization
FocalLens: Instruction Tuning Enables Zero-Shot Conditional Image Representations
Understanding Aggregate Trends for Apple Intelligence Using Differential Privacy
MM-Ego: Towards Building Egocentric Multimodal LLMs
Language Models Know More Than They Show: Exploring Hallucinations From the Model's Viewpoint
Do LLMs Know Internally When They Follow Instructions?
Exploring Empty Spaces: Human-in-the-Loop Data Augmentation
ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities
UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing
Fundamental Challenges in Evaluating Text2SQL Solutions and Detecting Their Limitations
Exploring Prediction Targets in Masked Pre-Training for Speech Foundation Models
Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis
An Efficient and Streaming Audio Visual Active Speaker Detection System
When Does a Predictor Know Its Own Loss?
Towards AI-Driven Sign Language Generation with Non-Manual Markers
DR-MPC: Deep Residual Model Predictive Control for Real-World Social Navigation
Towards Automatic Assessment of Self-Supervised Speech Models Using Rank
Speaker-IPL: Unsupervised Learning of Speaker Characteristics with i-Vector Based Pseudo-Labels
M2R2: Mixture of Multi-Rate Residuals for Efficient Transformer Inference
Does Spatial Cognition Emerge in Frontier Models?
SELMA: A Speech-Enabled Language Model for Virtual Assistant Interactions
Novel View Synthesis with Pixel-Space Diffusion Models
dMel: Speech Tokenization Made Simple
MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs
Evaluating Sample Utility for Data Selection by Mimicking Model Weights
Grounding Multimodal Large Language Models in Actions
Wearable Accelerometer Foundation Models for Health via Knowledge Distillation
KV Prediction for Improved Time to First Token
From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons
FlexTok: Resampling Images into 1D Token Sequences of Flexible Length
Transfer Learning in Scalable Graph Neural Network for Improved Physical Simulation
ARMOR: Egocentric Perception for Humanoid Robot Collision Avoidance and Motion Planning
Robust Autonomy Emerges from Self-Play
Private Federated Learning In Real World Application – A Case Study
Findings of the IWSLT 2024 Evaluation Campaign
ImmerseDiffusion: A Generative Spatial Audio Latent Diffusion Model
Theory, Analysis, and Best Practices for Sigmoid Self-Attention
Cut Your Losses in Large-Vocabulary Language Models
Neural Information Processing Systems (NeurIPS) 2024
Apple Machine Learning Research at NeurIPS 2024
Private and Personalized Frequency Estimation in a Federated Setting
How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts
Classifier-Free Guidance Is a Predictor-Corrector
Learning Elastic Costs to Shape Monge Displacements
GENOT: Entropic (Gromov) Wasserstein Flow Matching with Applications to Single-Cell Genomics
Leveraging Periodicity for Robustness with Multi-modal Mood Pattern Models
Strategic Linear Contextual Bandits
Towards Time-Series Reasoning with LLMs
Kaleido Diffusion: Improving Conditional Diffusion Models with Autoregressive Latent Modeling
Speech is More Than Words: Do Speech-to-Text Translation Systems Leverage Prosody?
Instance-Optimal Private Density Estimation in the Wasserstein Distance
Multimodal Autoregressive Pre-Training of Large Vision Encoders
Memory-Retaining Finetuning via Distillation
Faster Algorithms for User-Level Private Stochastic Convex Optimization
Private Stochastic Convex Optimization with Heavy Tails: Near-Optimality from Simple Reductions
Do LLMs Internally "Know" When They Follow Instructions?
Do LLMs Estimate Uncertainty Well in Instruction-Following?
Transformation-Invariant Learning and Theoretical Guarantees for OOD Generalization
Private Online Learning via Lazy Algorithms
Generalizable Error Modeling for Human Data Annotation: Evidence from an Industry-Scale Search Data Annotation Program
Misty: UI Prototyping Through Interactive Conceptual Blending
Compress and Compare: Interactively Evaluating Efficiency and Behavior Across ML Model Compression Experiments
Contextualization of ASR with LLM Using Phonetic Retrieval-Based Augmentation
Speculative Streaming: Fast LLM Inference Without Auxiliary Models
European Conference on Computer Vision (ECCV) 2024
Automated Code Fix Suggestions for Accessibility Issues in Mobile Apps
Retrieval-Augmented Correction of Named Entity Speech Recognition Errors
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
UI-JEPA: Towards Active Perception of User Intent Through Onscreen User Activity
Generating Gender Alternatives in Machine Translation
Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling
KGLens: Towards Efficient and Effective Knowledge Probing of Large Language Models with Knowledge Graphs
Direct Large Language Model Alignment Through Self-Rewarding Contrastive Prompt Distillation
LLM in a Flash: Efficient Large Language Model Inference with Limited Memory
BISCUIT: Scaffolding LLM-Generated Code with Ephemeral UIs in Computational Notebooks
ConvKGYarn: Spinning Configurable and Scalable Conversational Knowledge Graph QA Datasets with Large Language Models
Model-Driven Heart Rate Estimation and Heart Murmur Detection Based on Phonocardiogram
Tuning LLMs with Contrastive Alignment Instructions for Machine Translation in Unseen, Low-resource Languages
Apple Intelligence Foundation Language Models
DataComp-LM: In Search of the Next Generation of Training Sets for Language Models
Pre-Trained Foundation Model Representations to Uncover Breathing Patterns in Speech
LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference
Federated Learning With Differential Privacy for End-to-End Speech Recognition
Instance Optimal Private Density Estimation in the Wasserstein Distance
Samplable Anonymous Aggregation for Private Federated Data Analytics
PINE: Efficient Norm-Bound Verification for Secret-Shared Vectors
Projected Language Models: A Large Model Pre-Segmented Into Smaller Ones
Improving GFlowNets for Text-to-Image Diffusion Alignment
Towards Automated Accessibility Report Generation for Mobile Apps
International Conference on Machine Learning (ICML) 2024
On a Neural Implementation of Brenier's Polar Factorization