Google has unveiled Gemma 4 12B, a new unified multimodal model. Its biggest innovation is the complete removal of the "encoder" component, which has been essential in traditional multimodal models, achieving a qualitative leap in local deployment and inference efficiency on consumer-grade hardware. Reportedly, Gemma 4 12B can process text, vision, and audio on laptops with 16GB of RAM, and also has certain coding capabilities. It is well suited for on-device data analysis and information extraction, bringing agentic and multimodal capabilities to laptops.