Meta Unveils Open Source Llama 3.2: AI That Sees And Fits in Your Pocket

AICoin官方 · 2024-09-27T00:37:04.000Z

It's been a good week for open-source AI. On Wednesday, Meta announced an upgrade to its state-of-the-art large language model, Llama 3.2, and it doesn't just talk—it sees. More intriguing, some versions can squeeze into your smartphone without losing quality, which means you could potentially have private local AI interactions, apps and customizations without sending your data to third party servers. Unveiled Wednesday during Meta Connect, Llama 3.2 comes in four flavors, each packing a different punch. The heavyweight contenders—11B and 90B parameter models—flex their muscles with both text and image processing capabilities. They can tackle complex tasks such as analyzing charts, captioning images, and even pinpointing objects in pictures based on natural language descriptions. Llama 3.2 arrived the same week as Allen Institute’s Molmo, which claimed to be the best open-source multimodal vision LLM in synthetic benchmarks, performing in our tests on par with GPT-4o, Claude 3.5 Sonnet, and Reka Core. Zuck’s company also introduced two new flyweight champions: a pair of 1B and 3B parameter models designed for efficiency, speed, and limited but repetitive tasks that don’t require too much computation. These small models are multilingual text maestros with a knack for “tool-calling,” meaning they can integrate better with programming tools. Despite their diminutive size, they boast an impressive 128K token context window—the same as GPT4o and other powerful models—making them ideal for on-device summarization, instruction following, and rewriting tasks. Meta's engineering team pulled off some serious digital gymnastics to make this happen. First, they used structured pruning to trim the unnecessary data from larger models, then employed knowledge distillation—transferring knowledge from large models to smaller ones—to squeeze in extra smarts. The result was a set of compact models that outperformed rival competitors in their weight class, besting models including Google's Gemma 2 2.6B and Microsoft's Phi-2 2.7B on various benchmarks. Meta is also working hard to boost on-device AI. They've forged alliances with hardware titans Qualcomm, MediaTek, and Arm to ensure Llama 3.2 plays nice with mobile chips from day one. Cloud computing giants aren't left out either—AWS, Google Cloud, Microsoft Azure, and a host of others are offering instant access to the new models on their platforms. Under the hood, Llama 3.2's vision capabilities come from clever architectural tweaking. Meta's engineers baked in adapter weights onto the existing language model, creating a bridge between pre-trained image encoders and the text-processing core. In other words, the model’s vision capabilities don’t come at the expense of its text processing competence, so users can expect similar or better text results when compared to Llama 3.1. The Llama 3.2 release is Open Source—at least by Meta’s standards. Meta is making the models available for download on Llama.com and Hugging Face, as well as through their extensive partner ecosystem. Those interested in running it on the cloud can use their own Google Collab Notebook or use Groq for text-based interactions, generating nearly 5000 tokens in less than 3 seconds. Riding the Llama We put Llama 3.2 through its paces, quickly testing its capabilities across various tasks. In text-based interactions, the model performs on par with its predecessors. However, its coding abilities yielded mixed results. When tested on Groq's platform, Llama 3.2 successfully generated code for popular games and simple programs. Yet, the smaller 70B model stumbled when asked to create functional code for a custom game we devised. The more powerful 90B, however, was a lot more efficient and generated a functional game on the first try. You can see the full code generated by Llama-3.2 and all the other models we tested by clicking on this link. Identifying styles and subjective elements in images Llama 3.2 excels at identifying subjective elements in images. When presented with a futuristic, cyberpunk-style image and asked if it fit the steampunk aesthetic, the model accurately identified the style and its elements. It provided a satisfactory explanation, noting that the image didn't align with steampunk due to the absence of key elements associated with that genre. Chart Analysis (and SD image recognition) Chart analysis is another strong suit for Llama 3.2, though it does require high-resolution images for optimal performance. When we input a screenshot containing a chart—one that other models like Molmo or Reka could interpret—Llama's vision capabilities faltered. The model apologized, explaining that it couldn't read the letters properly due to the image quality. Text in Image Identification While Llama 3.2 struggled with small text in our chart, it performed flawlessly when reading text in larger images. We showed it a presentation slide introducing a person, and the model successfully understood the context, distinguishing between the name and job role without any errors. Verdict Overall, Llama 3.2 is a big improvement over its previous generation and is a great addition to the open-source AI industry. Its strengths are in image interpretation and large-text recognition, with some areas for potential improvement, particularly in processing lower-quality images and tackling complex, custom coding tasks. The promise of on-device compatibility is also good for the future of private and local AI tasks and is a great counterweight to close offers like Gemini Nano and Apple’s proprietary models. Edited by Josh Quittner and Sebastian Sinclair

对于开源人工智能来说，这是美好的一周。

周三，Meta 宣布升级其最先进的大型语言模型 Llama 3.2，它不仅能说话，还能看。

更有趣的是，有些版本可以挤进你的智能手机而不会损失质量，这意味着你可以进行私人本地 AI 交互、应用程序和定制，而无需将数据发送到第三方服务器。

Llama 3.2 于周三在 Meta Connect 上亮相，共有四种版本，每种版本都各有特色。重量级竞争者——11B 和 90B 参数型号——展示了其强大的文本和图像处理能力。

它们可以处理复杂的任务，例如分析图表、为图像添加字幕，甚至根据自然语言描述精确定位图片中的物体。

Llama 3.2 与艾伦研究所的 Molmo 在同一周推出，后者声称自己是综合基准测试中最好的开源多模态视觉 LLM，在我们的测试中表现与 GPT-4o、Claude 3.5 Sonnet 和 Reka Core 相当。

扎克伯格的公司还推出了两个新的轻量级模型：一对 1B 和 3B 参数模型，专为效率、速度以及有限但重复且不需要太多计算的任务而设计。

这些小型模型是多语言文本大师，擅长“工具调用”，这意味着它们可以更好地与编程工具集成。尽管它们体积小巧，但它们拥有令人印象深刻的 128K 令牌上下文窗口（与 GPT4o 和其他强大的模型相同），使其成为设备上摘要、指令跟踪和重写任务的理想选择。

Meta 的工程团队进行了一些严肃的数字操作，以实现这一目标。首先，他们使用结构化修剪来从较大的模型中修剪不必要的数据，然后使用知识蒸馏（将知识从大型模型转移到较小的模型）来挤出额外的智能。

最终，这一系列紧凑型模型的表现超越了同重量级别的竞争对手，在各种基准测试中超越了包括谷歌的 Gemma 2 2.6B 和微软的 Phi-2 2.7B 在内的模型。

Meta 也在努力提升设备上的 AI。他们与硬件巨头高通、联发科和 Arm 结成联盟，以确保 Llama 3.2 从第一天起就能与移动芯片完美兼容。云计算巨头也不甘落后——AWS、Google Cloud、Microsoft Azure 和许多其他公司都在其平台上提供对新模型的即时访问。

从底层来看，Llama 3.2 的视觉能力源自巧妙的架构调整。Meta 的工程师将适配器权重嵌入到现有的语言模型中，在预训练的图像编码器和文本处理核心之间架起了一座桥梁。

换句话说，该模型的视觉能力不会以牺牲其文本处理能力为代价，因此与 Llama 3.1 相比，用户可以期待类似或更好的文本结果。

Llama 3.2 版本是开源的——至少按照 Meta 的标准是这样。Meta 正在通过 Llama.com 和 Hugging Face 以及其广泛的合作伙伴生态系统提供模型下载。

有兴趣在云端运行它的人可以使用自己的 Google Collab Notebook 或使用 Groq 进行基于文本的交互，在不到 3 秒的时间内生成近 5000 个令牌。

骑骆驼
我们对 Llama 3.2 进行了全面测试，快速测试了其在各种任务中的功能。

在基于文本的交互中，该模型的表现与其前辈不相上下。然而，其编码能力却产生了好坏参半的结果。

在 Groq 平台上测试时，Llama 3.2 成功生成了流行游戏和简单程序的代码。然而，当要求较小的 70B 型号为我们设计的自定义游戏创建功能代码时，它却失败了。然而，功能更强大的 90B 效率更高，第一次尝试就生成了功能齐全的游戏。

点击此链接，您可以查看 Llama-3.2 生成的完整代码以及我们测试的所有其他模型。

识别图像中的风格和主观元素
Llama 3.2 擅长识别图像中的主观元素。当向模型展示一张未来主义的赛博朋克风格图像并询问它是否符合蒸汽朋克美学时，模型准确地识别出了这种风格及其元素。它给出了令人满意的解释，指出由于缺乏与该流派相关的关键元素，该图像与蒸汽朋克不符。

图表分析（和 SD 图像识别）
图表分析是 Llama 3.2 的另一个强项，不过它需要高分辨率图像才能发挥最佳性能。当我们输入包含图表的屏幕截图时（其他模型（如 Molmo 或 Reka）可以解读图表），Llama 的视觉能力就会下降。模型道歉，解释说由于图像质量问题，它无法正确读取字母。

图像识别中的文本
虽然 Llama 3.2 在图表中处理小文本时遇到了困难，但在阅读较大图像中的文本时却表现完美。我们向它展示了一个介绍一个人的演示幻灯片，模型成功地理解了上下文，毫无错误地区分了姓名和职位。

结论
总体而言，Llama 3.2 比上一代有了很大的改进，是开源 AI 行业的一大补充。它的优势在于图像解释和大文本识别，但也有一些可以改进的地方，特别是在处理低质量图像和处理复杂的自定义编码任务方面。

设备兼容性的承诺也有利于私人和本地 AI 任务的未来，并且是对 Gemini Nano 和 Apple 专有模型等产品的有力平衡。

由 Josh Quittner 和 Sebastian Sinclair 编辑

创作者的更多内容

实时新闻

创作者的更多内容

实时新闻

热门文章