Inaccurate, duplicate, and incomplete data continues to plague industries. Artificial intelligence is leveraged to mitigate these issues, but it has inherent limitations. AI datasets can contain mislabeled or irrelevant data.
Fraction AI is pioneering a new approach to data labeling by combining the efficiency of AI agents with human insights. The company recently completed a $6 million pre-seed funding round co-led by Symbolic and Spartan alongside strategic investments from Illia Polosukhin (Near), Sandeep Nailwal (Polygon), and other outstanding angel investors.
Fraction AI tackles the increasing challenge of producing high-quality data. Traditional methods depend solely on AI or humans. Fraction AI aims to use human understanding as guidance for AI agents. Funds from the round will go toward in-depth exploration and infrastructure upgrades to scale the cutting-edge hybrid approach, whose effectiveness is confirmed by research.
Introducing Gamified Adversarial Prompting
Data scientists have demonstrated that the datasets created using GAP, or gamified adversarial prompting, enhance the performance of the latest AI models. The GAP framework involves crowdsourcing high-quality data to fine-tune large multimodal models, turning data collection into an engaging game. It encourages players to provide complex, fine-grained questions and answers that fill gaps in the models’ knowledge.
In lay terms, Fraction AI incentivizes AI agents to create high-quality data through real-time competitions. Developers set up and launch agents using detailed instructions to guide their actions and achieve the best possible outcomes, while ether is staked as the economic foundation. Participants get economic incentives in what facilitates a continuous stream of valuable training data.
Current issues with data quality
Inaccurate data costs organizations tens of millions of dollars a year. Banal examples include misspelled customer names, customer addresses with errors, and incorrect data entries in general. Whatever the cause, inaccurate data cannot be used because it causes deviations throughout any data analysis.
When one imports data from multiple sources, it is not uncommon to end up with duplicate sets. Using retail as an example again, you might import customer lists from two sources and find a few people who bought things from both retailers. Duplicate records become a problem because you only want to count each customer once.
When data is combined from two different systems, inconsistent formatting can arise. Cross-system inconsistencies can cause major data quality issues unless they are identified and rectified swiftly.
Incomplete data and dark data are two additional problems. Some records are missing key information, such as phone numbers without area codes or demographic details without the age entered. Dark or hidden data is data that’s collected and stored but not actively used. IBM estimates that 90% of all sensor data collected from IoT devices remains unused. Many organizations aren’t even aware of this wasted resource, which accounts for more than 50% of the average organization’s data storage expenses.
Human understanding facilitates improvement
As an educational tool, GAP motivates humans to challenge the limitations of AI models, leading to notable improvements in performance. It encourages error detection by tasking players to identify inaccuracies or inconsistencies in datasets or AI outputs. Their diverse backgrounds can bring varied perspectives, making it easier to spot biases that a single development team might overlook.
Gamification encourages innovative thinking through challenges or puzzles designed to stretch the limits of a dataset or model. Players can uncover novel use cases, detect biased outputs or inputs, and propose more inclusive alternatives. This reduces systemic biases in data and models, creating a more equitable foundation for all kinds of applications. Additionally, participants will flag previously unnoticed data anomalies because they’ll be rewarded for uncovering flaws. Rewards for identifying significant flaws could conceivably be higher, reducing the risk of unexpected failures or vulnerabilities in real-world applications.
As the technology scales, more and more people can play games simultaneously, enabling exponential improvements as the sheer volume of input accelerates the identification of weaknesses.
The dark side of creativity
Creative problem-solving doesn’t have to be for the public good. The rewards would be the primary motivation for some users, leading to an excessive focus on them. Taking this a step further, it’s not unreasonable to expect malicious actors to try and game the system, and platforms will need to deploy mechanisms to detect and block harmful activities. An example is using AI and statistical models to monitor user behavior patterns, flagging anomalies that indicate spamming or unusual submission patterns. Unusually high submission rates or repetitive patterns from a single user could be flagged for review.
The GAP framework could assign reputation scores to participants based on their contribution history. Ideally, new users would have limited influence until they establish credibility to reduce the risk of initial exploitation.
Finally, there will be users flagging issues randomly. Platforms leveraging GAP will need to involve human experts or AI to deter participants from flagging accurate and valuable data.
Taking data quality mainstream
Risks aside, humans will be encouraged to spot mislabeled or irrelevant data in AI datasets, improving the quality of machine learning and AI models. Beyond AI, gamified contributions can enhance the accuracy and completeness of free, publicly accessible datasets like Wikipedia or OpenStreetMap. Flagging misinformation in real time will lead to more reliable repositories.
GAP will also impact harmful, biased, or inappropriate content. Platforms like Reddit or YouTube could adopt it to identify and remove such content faster.
Disclaimer: This article is provided for informational purposes only. It is not offered or intended to be used as legal, tax, investment, financial, or other advice.