2024-05-22 | Simon Liu

簡介

Phi-3 Vision-128K-Instruct 是由 Microsoft 開發的多模態模型，於今（5月22日）正式在 Huggingface 上開源。這個模型能同時處理文本和圖像數據，擁有 128K 的上下文長度，適合高品質、深入推理的數據處理任務。廣泛應用於圖像理解、光學字符識別（OCR）、圖表和表格解析等領域。

主要功能

多模態處理：同時支持文本和圖像輸入，適合資源有限的環境中高效運行。
低延遲場景：適用於需要快速反應的應用。
圖像理解：擁有強大的圖像解析能力，可處理各種圖表和表格。

模型架構

Phi-3 Vision-128K-Instruct 包含 4.2 億個參數，由圖像編碼器、連接器、投影器和 Phi-3 Mini 語言模型組成。該模型使用多達 5000 億 token 的多種類型圖片及文字資料進行訓練，包括嚴選公開內容、高品質教育資料與程式碼、高品質的圖文整合資料、新的「教科書等級」合成資料及圖表圖片，還有高品質的監督式聊天格式資料，涵蓋遵從指令、真實、誠實和助益等主題。資料蒐集過程中已篩選掉包含個資的資料，以確保隱私。

負責任的 AI 考量

使用模型時應遵守法律法規，並在高風險場景中進行安全評估。建議實施透明度最佳實踐並建立反饋機制。

使用方法

透過 transformers 庫中的 AutoModelForCausalLM 和 AutoProcessor 類加載和運行模型，以下為範例代碼：

from PIL import Image
import requests
from transformers import AutoModelForCausalLM, AutoProcessor

model_id = "microsoft/Phi-3-vision-128k-instruct"
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda", trust_remote_code=True, torch_dtype="auto")
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

messages = [
    {"role": "user", "content": "<|image_1|>\nWhat is shown in this image?"},
    {"role": "assistant", "content": "This chart shows the percentage of respondents agreeing with various statements about meeting preparedness."},
]

url = "https://example.com/image.png"
image = Image.open(requests.get(url, stream=True).raw)
prompt = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(prompt, [image], return_tensors="pt").to("cuda:0")

generation_args = {"max_new_tokens": 500, "temperature": 0.0, "do_sample": False}
generate_ids = model.generate(**inputs, eos_token_id=processor.tokenizer.eos_token_id, **generation_args)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

print(response)

資源與技術文檔

更多詳情及技術文檔請參考以下資源：

Phi-3 Microsoft Blog
Phi-3 Technical Report
Phi-3 on Azure AI Studio
Phi-3 Cookbook

結論

Phi-3 Vision-128K-Instruct 是一款輕量級、多模態模型，具備高達 128K 的上下文長度。該模型基於綜合的文本與圖像數據進行訓練，專注於高質量和高推理密度的數據。適用於廣泛的商業和研究用途，尤其是在計算資源有限和延遲敏感的環境中。Phi-3 Vision-128K-Instruct 模型具有強大的圖像理解和光學字符識別 (OCR) 能力，並提供了安全和責任AI的考量。該模型使用方便，並且在各種零樣本基準測試中表現優異。

更多詳情請參考 Phi-3 Vision-128K-Instruct

資料來源:https://medium.com/@simon3458/phi-3-vision-brief-introduction-de97639d4eb8

[快速帶你看] Phi-3 Vision — 微軟所出的多模態小型文字圖像開源模型

簡介

主要功能

模型架構

負責任的 AI 考量

使用方法

資源與技術文檔

結論

相關

Related posts

[專利情報分析報導]-[博大智權] – 量子科技 x 半導體專利情報

各國淨零碳排推展與「碳稅」勢不可擋

[專利情報分析報導]-[博大智權] – 美國地區儲能專利情報

簡介

主要功能

模型架構

負責任的 AI 考量

使用方法

資源與技術文檔

結論

相關

Related posts

黃仁勳：AI需求依然強勁 台積、鴻海等台鏈振奮

詹文男 X 葉冠廷 餐飲連鎖的 AI 轉型

經濟部聚焦半導體、AI、無人機及能源等領域，深化臺日關鍵供應鏈合作

黃仁勳：AI需求依然強勁台積、鴻海等台鏈振奮

詹文男 X 葉冠廷餐飲連鎖的 AI 轉型