> To train, develop, and improve the artificial intelligence, machine learning, and models that we use to support our Services. We may use your Log and Usage Information and Prompts and Outputs Information for this purpose.
Am I the only one bothered by this? Same with Gemini Advanced (paid) training on your prompts. It feels like I’m paying with money, but also handing over my entire codebase to improve your products. Can’t you do synthetic training data generation at this point, along with the massive amount of Q/A online to not require this?
Oh, that's not great. Cursor has a privacy mode where you can avoid this.
>If you enable "Privacy Mode" in Cursor's settings: zero data retention will be enabled, and none of your code will ever be stored or trained on by us or any third-party.
Yeah that's a bad look. If I have an API key visible in my code does that get packaged up as a "prompt" automatically? Could it be spat out to some other user of a model in the future?
(I assume that there's a reason that wouldn't happen, but it would be nice to know what that reason is.)
I wonder how hard it is to fish the keys out of the model weights later with prompting . Presumably possible to literally brute force it by giving it the first couple chars and maybe an env variable name and asking it to complete it
I'm also interested in the details on how this works in practice. I know that there was a front page post a few weeks ago about how Cursor worked, and there was a short blurb about how sets of security prompts told the LLM to not do things like hard code API keys, but nothing on the training side.
Yeah, I was referring to their webapp/Chat, aka Gemini Advanced. It uses your prompts for training unless you turn off chat history completely, or are in their “Workspace” enterprise version.
> Google collects your chats (including recordings of your Gemini Live interactions), what you share with Gemini Apps (like files, images, and screens), related product usage information, your feedback, and info about your location. Info about your location includes the general area from your device, IP address, or Home or Work addresses in your Google Account. Learn more about location data at g.co/privacypolicy/location.
Google uses this data, consistent with our Privacy Policy, to provide, improve, and develop Google products and services and machine-learning technologies, including Google’s enterprise products such as Google Cloud.
Gemini Apps Activity is on by default if you are 18 or older. Users under 18 can choose to turn it on. If your Gemini Apps Activity setting is on, Google stores your Gemini Apps activity with your Google Account for up to 18 months. You can change this to 3 or 36 months in your Gemini Apps Activity setting.
Without exception, every AI company is a play for your data. AI requires a continuing supply of new data to train on, it does not "get better" merely by using the existing trainsets with more compute.
Furthermore, synthetic data is a flawed concept. At a minimum, it tends to propagate and amplify biases in the model generating the data. If you ignore that, there's also the fundamental issue that data doesn't exist purely to run more gradient descent, but to provide new information that isn't already compressed into the existing model. Providing additional copies of the same information cannot help.
> Same with Gemini Advanced (paid) training on your prompts
I'm not sure if this is true.
> 17. Training Restriction. Google will not use Customer Data to train or fine-tune any AI/ML models without Customer's prior permission or instruction.
> This Generative AI for Google Workspace Privacy Hub covers... the Gemini app on web (i.e. gemini.google.com) and mobile (Android and iOS).
> Your content is not used for any other customers. Your content is not human reviewed or used for Generative AI model training outside your domain without permission.
> The prompts that a user enters when interacting with features available in Gemini are not used beyond the context of the user trust boundary. Prompt content is not used for training generative AI models outside of your domain without your permission.
> Does Google use my data (including prompts) to train generative AI models? No. User prompts are considered customer data under the Cloud Data Processing Addendum.
> When you use Unpaid Services, including, for example, Google AI Studio and the unpaid quota on Gemini API, Google uses the content you submit to the Services and any generated responses to provide, improve, and develop Google products and services and machine learning technologies, including Google's enterprise features, products, and services, consistent with our Privacy Policy.
Zero-data retention mode is the default for any user on a team or enterprise plan and can be enabled by any individual from their profile page.
With zero-data retention mode enabled, code data is not persisted at our servers or by any of our subprocessors. The code data is still visible to our servers in memory for the lifetime of the request, and may exist for a slightly longer period (on the order of minutes to hours) for prompt caching The code data submitted by zero-data retention mode users will never be trained on. Again, zero-data retention mode is on by default for teams and enterprise customers.
Hey we all want to keep and eat the cake, but I'm (kinda?) surprised that people expect these services that have been trained on large swaths of "available" data and now don't want to contribute. Even if you're paying: why the selfishness?
I think it is more of, LLMs should be treated as a utility service. Unless Google and others can clearly show the training data involved, the price that providers can charge for LLMs should be capped. I have no issue with contributing my conversations and my open source code, and I should expect in return a fair price.
https://windsurf.com/privacy-policy
Am I the only one bothered by this? Same with Gemini Advanced (paid) training on your prompts. It feels like I’m paying with money, but also handing over my entire codebase to improve your products. Can’t you do synthetic training data generation at this point, along with the massive amount of Q/A online to not require this?