Meta Unveils Open Source Llama 3.2: AI That Sees And Fits in Your Pocket

AICoin官方 · 2024-09-27T00:37:04.000Z

It's been a good week for open-source AI. On Wednesday, Meta announced an upgrade to its state-of-the-art large language model, Llama 3.2, and it doesn't just talk—it sees. More intriguing, some versions can squeeze into your smartphone without losing quality, which means you could potentially have private local AI interactions, apps and customizations without sending your data to third party servers. Unveiled Wednesday during Meta Connect, Llama 3.2 comes in four flavors, each packing a different punch. The heavyweight contenders—11B and 90B parameter models—flex their muscles with both text and image processing capabilities. They can tackle complex tasks such as analyzing charts, captioning images, and even pinpointing objects in pictures based on natural language descriptions. Llama 3.2 arrived the same week as Allen Institute’s Molmo, which claimed to be the best open-source multimodal vision LLM in synthetic benchmarks, performing in our tests on par with GPT-4o, Claude 3.5 Sonnet, and Reka Core. Zuck’s company also introduced two new flyweight champions: a pair of 1B and 3B parameter models designed for efficiency, speed, and limited but repetitive tasks that don’t require too much computation. These small models are multilingual text maestros with a knack for “tool-calling,” meaning they can integrate better with programming tools. Despite their diminutive size, they boast an impressive 128K token context window—the same as GPT4o and other powerful models—making them ideal for on-device summarization, instruction following, and rewriting tasks. Meta's engineering team pulled off some serious digital gymnastics to make this happen. First, they used structured pruning to trim the unnecessary data from larger models, then employed knowledge distillation—transferring knowledge from large models to smaller ones—to squeeze in extra smarts. The result was a set of compact models that outperformed rival competitors in their weight class, besting models including Google's Gemma 2 2.6B and Microsoft's Phi-2 2.7B on various benchmarks. Meta is also working hard to boost on-device AI. They've forged alliances with hardware titans Qualcomm, MediaTek, and Arm to ensure Llama 3.2 plays nice with mobile chips from day one. Cloud computing giants aren't left out either—AWS, Google Cloud, Microsoft Azure, and a host of others are offering instant access to the new models on their platforms. Under the hood, Llama 3.2's vision capabilities come from clever architectural tweaking. Meta's engineers baked in adapter weights onto the existing language model, creating a bridge between pre-trained image encoders and the text-processing core. In other words, the model’s vision capabilities don’t come at the expense of its text processing competence, so users can expect similar or better text results when compared to Llama 3.1. The Llama 3.2 release is Open Source—at least by Meta’s standards. Meta is making the models available for download on Llama.com and Hugging Face, as well as through their extensive partner ecosystem. Those interested in running it on the cloud can use their own Google Collab Notebook or use Groq for text-based interactions, generating nearly 5000 tokens in less than 3 seconds. Riding the Llama We put Llama 3.2 through its paces, quickly testing its capabilities across various tasks. In text-based interactions, the model performs on par with its predecessors. However, its coding abilities yielded mixed results. When tested on Groq's platform, Llama 3.2 successfully generated code for popular games and simple programs. Yet, the smaller 70B model stumbled when asked to create functional code for a custom game we devised. The more powerful 90B, however, was a lot more efficient and generated a functional game on the first try. You can see the full code generated by Llama-3.2 and all the other models we tested by clicking on this link. Identifying styles and subjective elements in images Llama 3.2 excels at identifying subjective elements in images. When presented with a futuristic, cyberpunk-style image and asked if it fit the steampunk aesthetic, the model accurately identified the style and its elements. It provided a satisfactory explanation, noting that the image didn't align with steampunk due to the absence of key elements associated with that genre. Chart Analysis (and SD image recognition) Chart analysis is another strong suit for Llama 3.2, though it does require high-resolution images for optimal performance. When we input a screenshot containing a chart—one that other models like Molmo or Reka could interpret—Llama's vision capabilities faltered. The model apologized, explaining that it couldn't read the letters properly due to the image quality. Text in Image Identification While Llama 3.2 struggled with small text in our chart, it performed flawlessly when reading text in larger images. We showed it a presentation slide introducing a person, and the model successfully understood the context, distinguishing between the name and job role without any errors. Verdict Overall, Llama 3.2 is a big improvement over its previous generation and is a great addition to the open-source AI industry. Its strengths are in image interpretation and large-text recognition, with some areas for potential improvement, particularly in processing lower-quality images and tackling complex, custom coding tasks. The promise of on-device compatibility is also good for the future of private and local AI tasks and is a great counterweight to close offers like Gemini Nano and Apple’s proprietary models. Edited by Josh Quittner and Sebastian Sinclair

對於開源人工智能來說，這是美好的一週。

週三，Meta 宣佈升級其最先進的大型語言模型 Llama 3.2，它不僅能說話，還能看。

更有趣的是，有些版本可以擠進你的智能手機而不會損失質量，這意味着你可以進行私人本地 AI 交互、應用程序和定製，而無需將數據發送到第三方服務器。

Llama 3.2 於週三在 Meta Connect 上亮相，共有四種版本，每種版本都各有特色。重量級競爭者——11B 和 90B 參數型號——展示了其強大的文本和圖像處理能力。

它們可以處理複雜的任務，例如分析圖表、爲圖像添加字幕，甚至根據自然語言描述精確定位圖片中的物體。

Llama 3.2 與艾倫研究所的 Molmo 在同一周推出，後者聲稱自己是綜合基準測試中最好的開源多模態視覺 LLM，在我們的測試中表現與 GPT-4o、Claude 3.5 Sonnet 和 Reka Core 相當。

扎克伯格的公司還推出了兩個新的輕量級模型：一對 1B 和 3B 參數模型，專爲效率、速度以及有限但重複且不需要太多計算的任務而設計。

這些小型模型是多語言文本大師，擅長“工具調用”，這意味着它們可以更好地與編程工具集成。儘管它們體積小巧，但它們擁有令人印象深刻的 128K 令牌上下文窗口（與 GPT4o 和其他強大的模型相同），使其成爲設備上摘要、指令跟蹤和重寫任務的理想選擇。

Meta 的工程團隊進行了一些嚴肅的數字操作，以實現這一目標。首先，他們使用結構化修剪來從較大的模型中修剪不必要的數據，然後使用知識蒸餾（將知識從大型模型轉移到較小的模型）來擠出額外的智能。

最終，這一系列緊湊型模型的表現超越了同重量級別的競爭對手，在各種基準測試中超越了包括谷歌的 Gemma 2 2.6B 和微軟的 Phi-2 2.7B 在內的模型。

Meta 也在努力提升設備上的 AI。他們與硬件巨頭高通、聯發科和 Arm 結成聯盟，以確保 Llama 3.2 從第一天起就能與移動芯片完美兼容。雲計算巨頭也不甘落後——AWS、Google Cloud、Microsoft Azure 和許多其他公司都在其平臺上提供對新模型的即時訪問。

從底層來看，Llama 3.2 的視覺能力源自巧妙的架構調整。Meta 的工程師將適配器權重嵌入到現有的語言模型中，在預訓練的圖像編碼器和文本處理核心之間架起了一座橋樑。

換句話說，該模型的視覺能力不會以犧牲其文本處理能力爲代價，因此與 Llama 3.1 相比，用戶可以期待類似或更好的文本結果。

Llama 3.2 版本是開源的——至少按照 Meta 的標準是這樣。Meta 正在通過 Llama.com 和 Hugging Face 以及其廣泛的合作伙伴生態系統提供模型下載。

有興趣在雲端運行它的人可以使用自己的 Google Collab Notebook 或使用 Groq 進行基於文本的交互，在不到 3 秒的時間內生成近 5000 個令牌。

騎駱駝
我們對 Llama 3.2 進行了全面測試，快速測試了其在各種任務中的功能。

在基於文本的交互中，該模型的表現與其前輩不相上下。然而，其編碼能力卻產生了好壞參半的結果。

在 Groq 平臺上測試時，Llama 3.2 成功生成了流行遊戲和簡單程序的代碼。然而，當要求較小的 70B 型號爲我們設計的自定義遊戲創建功能代碼時，它卻失敗了。然而，功能更強大的 90B 效率更高，第一次嘗試就生成了功能齊全的遊戲。

點擊此鏈接，您可以查看 Llama-3.2 生成的完整代碼以及我們測試的所有其他模型。

識別圖像中的風格和主觀元素
Llama 3.2 擅長識別圖像中的主觀元素。當向模型展示一張未來主義的賽博朋克風格圖像並詢問它是否符合蒸汽朋克美學時，模型準確地識別出了這種風格及其元素。它給出了令人滿意的解釋，指出由於缺乏與該流派相關的關鍵元素，該圖像與蒸汽朋克不符。

圖表分析（和 SD 圖像識別）
圖表分析是 Llama 3.2 的另一個強項，不過它需要高分辨率圖像才能發揮最佳性能。當我們輸入包含圖表的屏幕截圖時（其他模型（如 Molmo 或 Reka）可以解讀圖表），Llama 的視覺能力就會下降。模型道歉，解釋說由於圖像質量問題，它無法正確讀取字母。

圖像識別中的文本
雖然 Llama 3.2 在圖表中處理小文本時遇到了困難，但在閱讀較大圖像中的文本時卻表現完美。我們向它展示了一個介紹一個人的演示幻燈片，模型成功地理解了上下文，毫無錯誤地區分了姓名和職位。

結論
總體而言，Llama 3.2 比上一代有了很大的改進，是開源 AI 行業的一大補充。它的優勢在於圖像解釋和大文本識別，但也有一些可以改進的地方，特別是在處理低質量圖像和處理複雜的自定義編碼任務方面。

設備兼容性的承諾也有利於私人和本地 AI 任務的未來，並且是對 Gemini Nano 和 Apple 專有模型等產品的有力平衡。

由 Josh Quittner 和 Sebastian Sinclair 編輯

創作者的更多內容

實時新聞

創作者的更多內容

實時新聞

熱門文章