Apple Machine Learning Research
- FastVLM: Efficient Vision encoding for Vision Language Models
- Disentangled Representational Learning with the Gromov-Monge Gap
- Scaling Laws for Native Multimodal Models
- Step-by-Step Diffusion: An Elementary Tutorial
- DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation
- International Conference on Learning Representations (ICLR) 2025
- Scaling Diffusion Language Models via Adaptation from Autoregressive Models
- CoMotion: Concurrent Multi-Person 3D Motion
- EC-DIT: Scaling Diffusion Transformers with Adaptive Expert-Choice Routing
- TIS-DPO: Token-level Importance Sampling for Direct Preference Optimization
- FocalLens: Instruction Tuning Enables Zero-Shot Conditional Image Representations
- Understanding Aggregate Trends for Apple Intelligence Using Differential Privacy
- MM-Ego: Towards Building Egocentric Multimodal LLMs
- Language Models Know More Than They Show: Exploring Hallucinations From the Model's Viewpoint
- Do LLMs Know Internally When They Follow Instructions?
- Exploring Empty Spaces: Human-in-the-Loop Data Augmentation
- ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities
- UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing
- Fundamental Challenges in Evaluating Text2SQL Solutions and Detecting Their Limitations
- Exploring Prediction Targets in Masked Pre-Training for Speech Foundation Models
- Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis
- An Efficient and Streaming Audio Visual Active Speaker Detection System
- When Does a Predictor Know Its Own Loss?
- Towards AI-Driven Sign Language Generation with Non-Manual Markers
- DR-MPC: Deep Residual Model Predictive Control for Real-World Social Navigation
- Towards Automatic Assessment of Self-Supervised Speech Models Using Rank
- Speaker-IPL: Unsupervised Learning of Speaker Characteristics with i-Vector Based Pseudo-Labels
- M2R2: Mixture of Multi-Rate Residuals for Efficient Transformer Inference
- Does Spatial Cognition Emerge in Frontier Models?
- SELMA: A Speech-Enabled Language Model for Virtual Assistant Interactions
- Novel View Synthesis with Pixel-Space Diffusion Models
- dMel: Speech Tokenization Made Simple
- MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs
- Evaluating Sample Utility for Data Selection by Mimicking Model Weights
- Grounding Multimodal Large Language Models in Actions
- Wearable Accelerometer Foundation Models for Health via Knowledge Distillation
- KV Prediction for Improved Time to First Token
- From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons
- FlexTok: Resampling Images into 1D Token Sequences of Flexible Length
- Transfer Learning in Scalable Graph Neural Network for Improved Physical Simulation
- ARMOR: Egocentric Perception for Humanoid Robot Collision Avoidance and Motion Planning
- Robust Autonomy Emerges from Self-Play
- Private Federated Learning In Real World Application – A Case Study
- Findings of the IWSLT 2024 Evaluation Campaign
- ImmerseDiffusion: A Generative Spatial Audio Latent Diffusion Model
- Theory, Analysis, and Best Practices for Sigmoid Self-Attention
- Cut Your Losses in Large-Vocabulary Language Models
- Neural Information Processing Systems (NeurIPS) 2024
- Apple Machine Learning Research at NeurIPS 2024
- Private and Personalized Frequency Estimation in a Federated Setting
- How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts
- Classifier-Free Guidance Is a Predictor-Corrector
- Learning Elastic Costs to Shape Monge Displacements
- GENOT: Entropic (Gromov) Wasserstein Flow Matching with Applications to Single-Cell Genomics
- Leveraging Periodicity for Robustness with Multi-modal Mood Pattern Models
- Strategic Linear Contextual Bandits
- Towards Time-Series Reasoning with LLMs
- Kaleido Diffusion: Improving Conditional Diffusion Models with Autoregressive Latent Modeling
- Speech is More Than Words: Do Speech-to-Text Translation Systems Leverage Prosody?
- Instance-Optimal Private Density Estimation in the Wasserstein Distance
- Multimodal Autoregressive Pre-Training of Large Vision Encoders
- Memory-Retaining Finetuning via Distillation
- Faster Algorithms for User-Level Private Stochastic Convex Optimization
- Private Stochastic Convex Optimization with Heavy Tails: Near-Optimality from Simple Reductions
- Do LLMs Internally "Know" When They Follow Instructions?
- Do LLMs Estimate Uncertainty Well in Instruction-Following?
- Transformation-Invariant Learning and Theoretical Guarantees for OOD Generalization
- Private Online Learning via Lazy Algorithms
- Generalizable Error Modeling for Human Data Annotation: Evidence from an Industry-Scale Search Data Annotation Program
- Misty: UI Prototyping Through Interactive Conceptual Blending
- Compress and Compare: Interactively Evaluating Efficiency and Behavior Across ML Model Compression Experiments
- Contextualization of ASR with LLM Using Phonetic Retrieval-Based Augmentation
- Speculative Streaming: Fast LLM Inference Without Auxiliary Models
- European Conference on Computer Vision (ECCV) 2024
- Automated Code Fix Suggestions for Accessibility Issues in Mobile Apps
- Retrieval-Augmented Correction of Named Entity Speech Recognition Errors
- Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
- UI-JEPA: Towards Active Perception of User Intent Through Onscreen User Activity
- Generating Gender Alternatives in Machine Translation
- Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling
- KGLens: Towards Efficient and Effective Knowledge Probing of Large Language Models with Knowledge Graphs
- Direct Large Language Model Alignment Through Self-Rewarding Contrastive Prompt Distillation
- LLM in a Flash: Efficient Large Language Model Inference with Limited Memory
- BISCUIT: Scaffolding LLM-Generated Code with Ephemeral UIs in Computational Notebooks
- ConvKGYarn: Spinning Configurable and Scalable Conversational Knowledge Graph QA Datasets with Large Language Models
- Model-Driven Heart Rate Estimation and Heart Murmur Detection Based on Phonocardiogram
- Tuning LLMs with Contrastive Alignment Instructions for Machine Translation in Unseen, Low-resource Languages
- Apple Intelligence Foundation Language Models
- DataComp-LM: In Search of the Next Generation of Training Sets for Language Models
- Pre-Trained Foundation Model Representations to Uncover Breathing Patterns in Speech
- LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference
- Federated Learning With Differential Privacy for End-to-End Speech Recognition
- Instance Optimal Private Density Estimation in the Wasserstein Distance
- Samplable Anonymous Aggregation for Private Federated Data Analytics
- PINE: Efficient Norm-Bound Verification for Secret-Shared Vectors
- Projected Language Models: A Large Model Pre-Segmented Into Smaller Ones
- Improving GFlowNets for Text-to-Image Diffusion Alignment
- Towards Automated Accessibility Report Generation for Mobile Apps
- International Conference on Machine Learning (ICML) 2024
- On a Neural Implementation of Brenier's Polar Factorization