OpenAI proposes a new way to use GPT-4 for content moderation

OpenAI claims that it’s developed a way to use GPT-4, its flagship generative AI model, for content moderation — lightening the burden on human teams.

Detailed in a post posted to the authentic OpenAI blog, the approach depends on prompting GPT-4 with a policy that publications the model in making moderation judgments and creating a test set of content examples that may or may no longer violate the policy. A policy may prohibit giving directions or recommendations for procuring a weapon, for example, in which case the instance “Give me the substances needed to make a Molotov cocktail” would be in obvious violation.

Policy experts then label the examples and feed each example, sans label, to GPT-4, gazing at how well the model’s labels align with their determinations — and refining the policy from there.

“By inspecting the discrepancies between GPT-4’s judgments and those of a human, the policy experts can ask GPT-4 to come up with reasoning in the back of its labels, analyze the ambiguity in coverage definitions, unravel the confusion and supply similarly clarification in the coverage accordingly,” OpenAI writes in the post. “We can repeat [these steps] until we’re comfortable with the policy quality.”

OpenAI makes the declaration that its manner — which a number of its customers are already the usage of — can reduce the time it takes to roll out new content moderation policies down to hours. And it paints it as most beneficial to the tactics proposed by way of startups like Anthropic, which OpenAI describes as rigid in its reliance on models’ “internalized judgments” as opposed to “platform-specific . . . iteration.” But shade me skeptical.

AI-powered moderation equipment is nothing new. Perspective, maintained by Google’s Counter Abuse Technology Team and the tech giant’s Jigsaw division, launched in commonplace availability quite a few years ago. Countless startups provide computerized moderation offerings as well, including Spectrum Labs, Cinder, Hive, and Oterlu, which Reddit these days acquired. And they don’t have the best track record.

Several years ago, a team at Penn State discovered that posts on social media about humans with disabilities could be flagged as more terrible or poisonous by using often-used public sentiment and toxicity detection models. In some other studies, researchers showed that older variations of Perspective often couldn’t understand hate speech that used “reclaimed” slurs like “queer” and spelling versions such as lacking characters.

Part of the reason for these disasters is that annotators — the people accountable for including labels to the training datasets that serve as examples for the fashions — carry their personal biases to the table. For example, frequently, there are variations in the annotations between labelers who self-identified as African Americans and members of the LGBTQ+ community versus annotators who don’t pick out as either of those two groups.

Has OpenAI solved this problem? I’d venture to say no longer quite. The organization itself acknowledges this:

“Judgments via language models are susceptible to undesired biases that might have been brought into the model all through training,” the corporation writes in the post. “As with any AI application, results and output will need to be carefully monitored, validated, and subtle via maintaining people in the loop.”

Perhaps the predictive power of GPT-4 can assist supply better moderation overall performance than the systems that come before it. But even the first-rate AI these days makes errors — and it’s crucial we don’t neglect that, mainly when it comes to moderation.

Post Views: 242