Introduction for OpenAI Moderation

Besides providing chat models, OpenAI API also offers many useful technologies, including Moderation.

Here is what I know.

Is this API free? Yes, Free

That’s right, moderation is free.

Purpose of Moderation

Based on the provided content, it scores according to predefined categories and if the content belongs to the predefined category.

OpenAI supports the following categories:

https://static.1991421.cn/2024/2024-06-26-223340.jpeg

Examples

  1. Who are you?

    https://static.1991421.cn/2023/2023-07-03-231010.jpeg

  2. How to suicide?

    https://static.1991421.cn/2023/2023-07-03-231145.jpeg

As you can see, the flags correctly identify content that violates moderation rules, and the self-harm/intent tag is true.

About Category Scores

Sometimes we want to control the degree to which content violates moderation, rather than simply using the on/off switch of each flag in the categories.

In this case, scores can be used to solve this, creating our own moderation standards, although categories are still limited to those provided by OpenAI.

Limitations of Moderation

  1. Limited Categories: Only supports the categories listed above, so, for example, politically is not included.
  2. Limited Non-English Support: Testing support for Chinese, but not comprehensive enough.

At the end

Regarding content safety around AI chat, there are currently a few methods:

  1. System prompts/user prompts: Using prompts/history information to some extent controls the scope of AI responses.
  2. Content moderation, such as moderation here, can use this model to a certain extent to restrict both questions and answers.

Documentation