We investigate embedding models on new "needle-in-haystack" tasks and find that beyond 4K tokens, they're just rolling dice - even with exact lexical matches or query expansion, they can't tell signal from noise in long context.