articles
4 hours

These 7 AI Tools Are Secretly Stealing Your Data (And 3 That Actually Protect You)

Most popular AI tools quietly collect your personal data. Which ones to avoid and what are the privacy-first alternatives for safer digital life.

8 min.

alir1272

We use Artificial Intelligence (AI) tools almost every day, from writing emails and generating content to asking relevant questions. The use of AI tools cuts across diverse areas as they promise unparalleled convenience, efficiency, and productivity.

However, the convenience of these generative AI tools, coupled with unmonitored adoption, often comes at a hidden cost: your data privacy.

AI tools require massive amounts of data to function. Sadly, these data are often farmed and used without clear or transparent consent from users. Yet there are limited or no standard data privacy laws or data control in place to manage the risks this entails.

Research reveals that about 84% of analyzed AI web tools have experienced at least one data breach. This figure highlights the immense risk present across the Artificial Intelligence landscape, particularly in “free” productivity-focused tools.

Unregulated AI data harvesting may lead to an “AI data privacy crisis.” Artificial Intelligence data privacy crisis is the escalating, systemic risk to personal and proprietary information posed by the rapid adoption of AI tools.

This post will expose the 7 AI tools that are secretly stealing your data, focusing on their risk categories. It will also highlight how to spot them and how to protect yourself from them.

Hidden Cost of “Free” AI Tools

When it comes to “free” AI tools, the associated cost is rarely zero. The truth is, most of these free but powerful AI tools operate on a trade-off - farming users’ data and your privacy.

This farmed data can then be used to train other models “freely,” posing various risks to individuals and businesses. Therefore, for the most popular, public-facing, free AI tools, you are actually the product, and not the customer.

Generative AI tools are built to absorb every input they receive. In fact, cloud-hosted large language models (LLMs) retain uploaded content, almost forever.

Unknown to many, your chat logs, text prompts, voice recordings, file uploads, system and usage data, and feedback are the fuel for these AI tools, which require massive amounts of data to function.

Sadly, these data are often used and stored without users’ clear or transparent consent. Also, there is a lack of transparency around what happens to this data after it is collected and stored, especially with major AI providers like OpenAI, Google, and Meta.

For instance, OpenAI trains on data from both free and paid ChatGPT users by default unless you manually opt out. Also, OpenAI stated that deleted conversations are still kept for up to 30 days for safety purposes, even in “Temporary Chat” mode. That sounds like unlawful retention, right?

Meta’s policy allows it to use publicly shared content, posts, comments, and images from its platforms for training generative AI models.

For Google’s Gemini, it is difficult for users to ensure the complete deletion of uploaded content. Users’ chat history and uploads are set for auto-delete after 18 months by default, but users can manually change this to 3 months. Also, users’ chats reviewed by human teams for quality, safety, or policy violations are stored for up to 3 years.

7 Offenders: AI Tools That Secretly Harvest User Data

Here are 7 AI tools that you often use daily but are secretly harvesting and analyzing your personal and professional data.

ChatGPT (OpenAI)

By default, all your uploads (chat logs, text prompts, voice recordings, etc.) on ChatGPT are harvested, analysed, and used to enhance the underlying AI models.

Even with chat history turned off, a 30-day data retention period is maintained for safety review, meaning your prompts are not instantly deleted.

OpenAI's CEO has explicitly warned users that conversations lack legal confidentiality.

Google Gemini (Google)

Google Gemini is designed to connect to your Workspace data. This means that its responses are informed by content and metadata from your Gmail, Docs, and Drive.

Although Google claims that this content is not used for external model training, the seamless access to your sensitive personal data is the primary privacy risk.

Also, by default, users’ activities are set to auto-delete after 18 months, but users can change this to 3 or 36 months, or turn it off entirely. However, conversations reviewed by human annotators are stored for up to three years.

Therefore, users should not enter confidential information they would not want a human reviewer to see in Gemini.

Meta AI (Instagram/Facebook)

Meta explicitly stated in its policy that it uses users’ conversations with Meta AI for ad targeting and content personalization.

This practice integrates all your interactions and chat history with Meta AI directly into Meta's ad services.

GitHub Copilot AI

GitHub Copilot AI trained on tons of lines of code scraped from public GitHub repositories and this has sparked controversies. Worse still, the scrapping was often done without regard for the original open-source licenses.

Although GitHub now offers controls to prevent the use of new user code for training, its foundation is built on potentially uncompensated developer intellectual property.

Grammarly

To provide real-time suggestions, Grammarly sends almost all typed and uploaded content to its servers for analysis.

Though its policy prohibits selling your data, the transfer of sensitive or high-risk text to a third-party cloud platform is a security vulnerability for many users.

Perplexity AI

As a search-focused chatbot, Perplexity logs and retains every query, IP address, and other usage data by default. Perplexity claims to use this to improve its AI and services.

However, users can manually opt out of having their uploads farmed for training purposes in the account settings.

Replika

This Artificial Intelligence companion is specifically designed to store and analyze vast amounts of deeply personal, emotional, and psychological chat logs to build a highly realistic personality.

While the data is often promised to be private, indefinite storage of highly sensitive users’ confessions poses a long-term risk of breaches or shifts in the company's data use policy.

How Hackers Train AI on Stolen Data

Hackers are leveraging stolen and leaked data to secretly train sophisticated AI models in realistic, high-fidelity human communication and system patterns. This practice is referred to as underground or quiet AI farms.

This practice weaponizes private information, creating powerful tools for new types of cybercrime, and highlights that the problem of AI content misuse extends far beyond just major corporations collecting user data.

Examples include:

Therapy Logs: Hackers farm highly sensitive, emotional context, detailed personal struggles, and psychological vulnerabilities to train AI. An AI trained on this data can generate hyper-personalized deepfake voice messages or emails to exploit known emotional triggers for targeted scams and extortion.
Slack Exports: This involves leveraging internal corporate language, organizational hierarchy, project names, technical jargon, and private, informal communication styles. Created AI can craft perfectly tailored internal spear-phishing emails that mimic a colleague, project manager, or executive's tone and context.
Discord dumps: Artificial Intelligence is trained on informal, emotionally charged, or highly specific community slang, usernames, and direct message content to automate social engineering attacks.

Privacy Laws Aren’t Keeping Up

Privacy laws, such as the GDPR in Europe and the CCPA in California, were enacted to address the collection, use, and security of personal data in traditional contexts.

However, these laws struggle to keep up with rapid technological advancements, the complexity of content collection, and the global nature of the internet.

Even regulators don’t fully understand data provenance and struggle to regulate the scale and complexity of data processing inherent in modern AI systems.

Data provenance - the record of where a piece of data originated and how it was processed - is critical for AI accountability.

However, regulators often lack the deep technical expertise needed to audit the vast, proprietary, and complex training datasets of leading AI models. Also, AI model training data can come from diverse sources, creating an opaque supply chain.

3 Privacy-First AI Tools That Actually Protect You

Not all AI tools secretly steal your information. Some AI tools prioritize user privacy by offering open-source transparency, on-device processing, or end-to-end encryption.

These features make them better alternatives to popular AI services that may “steal” users’ personal data for training purposes. Examples of these privacy-first AI tools include:

Ollama: Ollama is an open-source, local-first project that lets you run large language models (LLMs) like Mistral or Llama 3 directly on your local server or laptop. Since the entire process runs on your laptop, your data stays safe and never leaves your hardware.
PrivateGPT is also an open-source, local-first project that enables you to interact with your content privately using an LLM. PrivateGPT uses a technique called Retrieval-Augmented Generation (RAG) to keep all your sensitive data completely private and offline.
Claude (Anthropic): Unlike the two projects above, Claude is a cloud-based service. However, it ensures that the data of Commercial/Enterprise users is not used to train their models by default.

Notwithstanding, Free and Pro account users must manually set the model training opt-out policy.

These privacy auditing tools can help you understand what information is being collected by AI tools:

Exodus Privacy: Most mobile AI tools contain third-party trackers for analytics, advertising, etc. Exodus helps identify and expose these embedded trackers and the permissions they request.
PrivacySpy: This tool provides a quick, community-led assessment of the privacy policies of various online AI tools for transparency and accountability.

Red Flags Checklist: How to Spot a Data-Hungry AI Tool

Here is a red flags checklist to help you spot a potentially data-hungry or high-risk AI tool:

Vague or missing privacy policy.
Data pooling
No clear deletion path.
Using your private, proprietary, or sensitive content for model training or for third-party systems.
Weak security certifications or compliance
No accessible training opt-out policy
No public changelog
Third-party cookie sharing
No transparent changelog and encryption policy
AI Washing and vague claims with no technical details about the underlying technology.
No human oversight or validation
Pressure to onboard immediately

When assessing AI tools, if you notice any of the red flags above, kindly avoid using such tools.

Conclusion

AI tools are powerful, unparalleled engines for innovation, promising convenience, efficiency, and productivity. However, many free AI tools save your data by default and use it to train and improve AI models, often without clear consent.

These actions potentially expose sensitive information and outpace current data protection regulations and privacy laws.

As individuals, our best defence is informed consent, minimal data sharing, and proactive security measures.

We recommend that you rethink your privacy and take proactive steps, such as being mindful of your inputs (uploads), reviewing privacy policies, choosing an opt-out policy, limiting public data, and using privacy auditing tools to understand how your uploads are being used.

By understanding how your data is used, you can better navigate the world of generative AI and safeguard your privacy in the age of intelligent automation.

How AI is is changing our everyday routines is well documented in this post - “AI in Daily Life: How Artificial Intelligence Is Changing Everyday Routines”

Frequently Asked Questions (FAQs)

Where is the data stored?

Depending on their size and function, Artificial Intelligence tools can store data in various locations, including cloud platforms, on-premises centers, and edge storage. Most “free” AI tools store data on cloud platforms.

Is user content used for retraining?

Yes, many free Artificial Intelligence tools save users’ data by default and use it for retraining and improving AI models, often without clear consent.

Can data be deleted on request?

Yes, data can be deleted on request from AI tools, but the timing and effectiveness of the process can vary significantly, especially with advanced generative AI. For instance, OpenAI stores “deleted data” for up to 30 days for safety purposes. Google’s Gemini, by default, stores users’ data for 18 months before deletion, and up to 3 years if the data is reviewed by human teams for quality, safety, or policy violations.

Who owns generated content?

The ownership of AI-generated content is complex and not definitively clear. However, the user who uploaded the prompts is often considered the owner. Nevertheless, this is subject to the AI tool’s terms of service.

Comments

0

All comments are moderated according to the portal rules