diff --git a/docs/ability/florence-2-large.md b/docs/ability/florence-2-large.md new file mode 100644 index 0000000000000000000000000000000000000000..fee48eb612298c843d99c6f0f1c2a2efbd282c62 --- /dev/null +++ b/docs/ability/florence-2-large.md @@ -0,0 +1,303 @@ +# 一、模型简介 + +Florence-2-large是微软出品的开源多功能图像标记模型,可以辅助标记图像内容,包括生成图像描述、目标识别等。得益于大模型架构,Florence-2还支持使用提示词定向标记图中特定对象。Gitee AI的Serverless API服务重新设计该模型的使用方法,将繁杂的功能整合为更易使用的接口,下面将分别进行介绍。 + +# 二、功能与使用 + +为了能顺利使用咱们的接口,拿到授权Token是很重要的,您可以在[此处](https://ai.gitee.com/dashboard/settings/tokens)获取免费额度的Token或[购买资源包](https://ai.gitee.com/serverless-api/order?package=1910)(需要登录)后绑定新建的Token。值得注意的是,该模型仅**支持和识别英文**,因此无论是想输入的提示词还是识别的内容和输出的结果**都将是英文的**,任何其它语言的信息都会导致模型出现无法预计的输出结果。 + +接口的使用以Python代码为例,下面将分别演示“图像描述”与“目标识别”两个功能项的使用方法。在开始前先定义请求函数如下: + +```python +import requests + +headers = { + "Authorization": "Bearer ", + "X-Package": "342" +} + +#用于请求图像描述的url +url_caption = "https://ai.gitee.com/v1/images/caption" +#用于请求目标识别的url +url_object_detection = "https://ai.gitee.com/v1/images/object-detection" + +def query(url, payload): + files = { + "image": (payload["image"], open(payload["image"], "rb")) + } + data = {key: payload[key] for key in payload if key not in files} + response = requests.post(url, headers=headers, files=files, data=data) + return response.json() +``` + +## 1. 图像描述生成 + +该功能通过请求中输入一种图片,从而获得该图片的文本描述信息,请求代码如下: + +```python +output = query(url_caption, { + "model": "Florence-2-large", + "image": "path/to/image.jpg", + "caption_level": 0 +}) +``` + +请求参数说明: + +- `image`:需要进行描述的图片,仅支持输入一张图片。 +- `caption_level`:需要描述图片的详细程度,**支持0、1、2三个等级**,等级越高则描述得越详细,输出的字数越多。等级可根据需求进行调整,若不填写则采用默认等级0。 + +本次示例采用的图片如下: + +![](../../static/img/base/florence-2-image.webp) + +设置 `caption_level=0`时输出如下: + +> A woman and a little girl walking down a dirt road. + +设置 `caption_level=1`时输出如下: + +> The image shows a woman and a little girl walking down a dirt road, hand in hand, with a horse in the background. The sky is filled with clouds and the ground is covered with lush green grass. The image is animated, giving it a whimsical feel. + +设置 `caption_level=2`时输出如下: + +> The image is an illustration of a mother and daughter walking hand in hand in a field. The mother is wearing a long white dress with pink flowers on it and has long blonde hair. She is holding the hand of the daughter, who is also wearing a purple dress. They are both smiling and appear to be enjoying each other's company. +> In the background, there is a fence with wooden posts and a horse grazing on the grass. The sky is filled with fluffy white clouds and the sun is shining brightly, creating a warm glow. The field is covered in yellow flowers and there are hills in the distance. The overall mood of the image is peaceful and serene. + +从结果中可以直观的感受到不同级别下描述详细程度的不同,您可根据需求选择不同的等级。 + +## 2. 目标识别 + +该功能通过在请求中输入一张图片,从而获得该图片主体目标的标签和位置信息。与图像描述不同的是,该功能不提供标签的置信度,每个结果都是确定的,且支持输入提示词参数 `prompt`进行辅助。以是否输入 `prompt`作为区分,该功能可分为"传统目标识别"和“指令目标识别” + +### 2.1 传统目标识别 + +该识别方法类似传统的目标识别任务,给出标签与绘制矩形的参数。基于该方法的请求代码如下: + +代码中仅使用 `image`和 `caption_level`两个参数: + +- `image`:需要进行描述的图片,仅支持输入一张图片。 +- `caption_level`:需要描述图片的详细程度,**支持0、1两个个等级**,等级越高则描述得越详细,输出的字数越多。等级可根据需求进行调整,若不填写则采用默认等级0。 + +使用如下的请求代码: + +```python +output = query({ + "model": "Florence-2-large", + "image": "path/to/image.jpg", + "caption_level": 0 +}) +``` + +请求结果格式如下: + +```json +{ + "num_objects": int, + "objects":[ + { + "label": str, + "bbox": [x1, y1, x2, y2] + }, ... + ] +} +``` + +是一个json格式,详细说明如下: + +- `num_objects`:识别到图中目标的数量。 +- `objects`:一个数组对象,数组中的每个对象包含了每个目标的标签和位置。 + - `label`:该目标的标签信息。 + - `bbox`:该目标的位置,是一个四元组,可两两分组为 `(x1,y1), (x2,y2)`,分别表示矩形框左上角和右下角的坐标。 + 我们使用相同的示例图片。 + +设置 `caption_level=0`时输出如下: + +```json +{ + "num_objects":5, + "objects":[ + { + "label":"animal", + "bbox":[58.880001068115234,598.3999633789062,201.21600341796875,748.1599731445312] + }, + { + "label":"girl", + "bbox":[321.0240173339844,914.5599975585938,478.72003173828125,1203.8399658203125] + }, + { + "label":"human face", + "bbox":[501.2480163574219,753.2799682617188,545.280029296875,795.5199584960938] + }, + { + "label":"human face", + "bbox":[379.39202880859375,929.9199829101562,414.2080078125,977.2799682617188] + }, + { + "label":"woman", + "bbox":[427.52001953125,700.7999877929688,804.35205078125,1238.4000244140625] + } + ] +} +``` + +设置 `caption_level=1`时输出如下: + +```json +{ + "num_objects":4, + "objects":[ + { + "label":"girl in white dress with pink flowers in field at sunset", + "bbox":[427.52001953125,700.7999877929688,805.3760375976562,1238.4000244140625] + }, + { + "label":"girl with red hair and blue dress in field with wooden fence", + "bbox":[311.8080139160156,914.5599975585938,479.7440185546875,1203.8399658203125] + }, + { + "label":"brown horse with blonde mane and tail in field", + "bbox":[58.880001068115234,598.3999633789062,201.21600341796875,748.1599731445312] + }, + { + "label":"human face", + "bbox":[501.2480163574219,753.2799682617188,545.280029296875,795.5199584960938] + } + ] +} +``` + +值得注意的是,设置不同的 `caption_level`可能会出现**完全不同**的识别结果。根据经验表明:``等级0识别出的结果要多于等级1的``。这是因为描述的详细程度需要依赖模型的识别能力,图中目标的信息越丰富(例如有丰富的环境和动作信息)则越能在等级1中被识别到。换句话说,等级1更偏向于选择复杂的目标,目标的质量更高。因此,您可以根据这样的经验选择合适的参数进行目标识别。 + +### 2.2 指令目标识别 + +该功能在请求输入一张图片的同时,还要输入一句 `prompt`提示词。模型将提取 `prompt`中与图片内容相关的词语作为标签结果进行目标识别。注意,若使用了 `prompt`参数,`caption_level`参数将会失效,标签内容的详细程度将由提示词决定。说了这么多还是有些抽象,让我们上代码: + +```python +output = query(url_onject_detection, { + "model": "Florence-2-large", + "image": "path/to/image.jpg", + "prompt": "beautifule girl in the image" +}) +``` + +请求参数说明: + +- `image`:需要进行描述的图片,仅支持输入一张图片。 +- `prompt`:辅助提示词。 + +请求结果如下: + +```json +{ + "num_objects":2, + "objects":[ + { + "label":"beautifule girl", + "bbox":[433.6640319824219,702.0799560546875,806.4000244140625,1239.679931640625] + }, + { + "label":"beautifule girl", + "bbox":[317.9520263671875,913.2799682617188,479.7440185546875,1208.9599609375] + } + ] +} +``` + +从请求结果中可以看到,模型发现 `prompt`中的'beautiful girl'的语义与图中的两个女孩子相近,因此模型对两个目标做目标识别,给出了 `bbox`位置,同时标签使用我们在 `prompt`中的词语或短句。 + +该功能适合用于当您想对图中指定的主体对象进行标记时,模型会使用咱们 `prompt`内的词语作为 `label`给到具体的标记对象。 + +:::tip +偷偷告诉您,我们发现热门影视动漫角色模型也能识别到哦(我们试过《星球大战》以及《海贼王》),可以尝试玩一下~ +::: + +# 三、标签结果后处理 + +在目标识别获得标签后,为了更好的看到标记效果,我们这边提供一个绘制结果的函数供参考,代码如下: + +```python +from PIL import Image, ImageDraw, ImageFont + +def draw_labelled_bbox(image, bbox, label): + draw = ImageDraw.Draw(image) + + # 设置字体 + font_size = 16 # 初始字体大小 + try: + font = ImageFont.truetype("arial.ttf", font_size) # 使用系统字体 + except IOError: + font = ImageFont.load_default() # 如果没有字体文件,使用默认字体 + + draw.rectangle(bbox, outline="red", width=3) + x1, y1, x2, y2 = bbox + max_width = x2 - x1 + + words = label.split() + text_lines = [] + current_line = "" + + for word in words: + test_line = f"{current_line} {word}".strip() + line_width = draw.textlength(test_line, font=font) + if line_width <= max_width: + current_line = test_line + else: + text_lines.append(current_line) + current_line = word + if current_line: + text_lines.append(current_line) + + line_height = font.getbbox("A")[3] - font.getbbox("A")[1] + label_height = line_height * len(text_lines) + 4 + label_box = [x1, y1 - label_height, x2, y1] + + draw.rectangle(label_box, fill="red") + text_y = y1 - label_height + 2 + for line in text_lines: + draw.text((x1 + 2, text_y), line, fill="white", font=font) + text_y += line_height + + return image + + +# 示例参数 +image = Image.open("/path/to/your/image.jpg") +bbox = [148, 276, 290, 653] +label = "put your label here" + +# 调用函数 +image = draw_labelled_bbox(image, bbox, label) + +# 展示与保存 +image.show() +image.save("/path/to/save.jpg") +``` + +以示例图片和具体的标签结果 `obj`: + +```json +{ + "label":"girl in white dress with pink flowers in field at sunset", + "bbox":[433.6640319824219,702.0799560546875,806.4000244140625,1239.679931640625] +} +``` + +为例(假设结果已转为python中的 `dict`类),我们使用如下代码调用函数: + +```python +image = loaded_image +label = obj['label'] +bbox = obj['bbox'] + +image = draw_labelled_bbox(image, bbox, label) +image.show() +``` + +可得到标注结果如下图所示: + +![](../../static/img/base/florence-tag-res.jpg) + +示例代码中有字体和调用相关的注释,您可以根据需求自行更改。 + +以上就是全部的教程了,祝您调用愉快! diff --git a/static/img/base/florence-2-image.webp b/static/img/base/florence-2-image.webp new file mode 100644 index 0000000000000000000000000000000000000000..da605d421cd0c2f89c30e97d24812bdcd9e5e840 Binary files /dev/null and b/static/img/base/florence-2-image.webp differ diff --git a/static/img/base/florence-tag-res.jpg b/static/img/base/florence-tag-res.jpg new file mode 100644 index 0000000000000000000000000000000000000000..d9b6acf55b49cf93eeafeab5daee30cb87b71b55 Binary files /dev/null and b/static/img/base/florence-tag-res.jpg differ