The primary case here made against Google might be questionable behavior from Google, but might not be illegal, since simple numbers like a dollar amount can’t be copyrighted, and AFAIK there is no other legal impediment to Google doing what they did (IANAL).
What looks far more inculpatory to me is this, on the last page:
[…] Because it controls essentially the entire
internet, Google has endless levers at its disposal to significantly harm or snuff out a rival.
I don’t think it’s a coincidence that our organic rankings have continued to suffer since I’ve
become a vocal critic of their practices. Earlier this year, within weeks of the publication of
a Wired magazine article that included quotes from me and a recap of our story, the CNW
mobile app was banned from the Google Play store without explanation or recourse. As a
result we were also banned from using the Google’s dominant Ad Mob mobile ad platform.
This is far more damning of Google abusing its monopoly, not only on search, but also on the Android platform.
> And what about those conjured celebrities I added as a precaution? All five were all scraped right into Google’s search result pages. It provided undeniable proof that after being turned down, Google simply went ahead and stole the entire database of content CNW took eight years and over a million dollars to build.
A number might not be copyrightable, but what about a news article? What would happen if all the world’s new sites just denied the Google crawler using robots.txt? What if they also returned 404s if accessed from an IP address known to be owned by Google?
And when Teddyh says "simple numbers like a dollar amount can’t be copyrighted" he's ignoring the fact that google isn't just copying dollar amounts. Google's copying a data set which establishes a dollar amount as a net worth of a particular person. That's a database.
The only difficulty for CNW will be coming up with the money, time, and emotional energy to achieve a result in court over as many years as it takes Google to exhaust all legal possibilities, of which there are as many as the defendant has dollars.
The problem is, we don't have a justice system, we have a legal system.
The mere fact that he compiled the database isn't sufficient to qualify for copyright protection. He'll have to show that he used a sufficient level of creativity to arrive at the numbers he displayed.
Based on his testimony, I suspect he'd have a reasonable case to argue, but he'd have to show that his researchers carefully weighed different sources and made other judgements for each individual celebrity rather than just running some script over public property records.
The article says they employed a number of researchers to come up with the numbers.
I don't think property records (or any other publicly available information) would be enough to come up with an accurate figure in an automated fashion. So, there's some educated guesses (aka "informed estimations") going on and therefore a form of "creativity" (IMHO).
I agree if the testimony is taken at face value, it makes a pretty reasonable case. However, there's nothing to say that the values he arrived at were accurate (not that it'd necessarily be material to the case) or that his 'researchers' were not just some content farm to write up articles around those numbers in order to enhance SEO. These are the kind of things that'd be deliberated upon in a court case.
I'm not sure about that; the mass of the moon to more decimal places would surely require actual research potentially at great expense, but the resulting number shouldn't be copyrightable since the resulting number is a fact about nature.
If each entry was individually researched by a person that made judgements by weighing various sources of information against each other to arrive at an estimate, then yes I think they'd have a pretty strong legal argument.
I would argue that even the appearance of taking someone else's work should be a damning blow to Google internally. Google prides itself of having some of the world's best academic researchers under their roof, and plagiarism in an academic context is usually a career-ending affair.
I finally understand what people meant when they would say "for some people, the Internet is Google". Google started off as a search engine, and many people were able to search for answers to their questions such that Google's search engine became the de facto platform for question-answering for this group of people. Over time, Google developed capabilities or acquired capabilities in answering questions in growing number of niches e.g. scraping websites for snippets like in this article or acquiring companies that had developed this sort of capabilities like ITA for flights [0]. As a result, the Google platform that is search started to vertically integrate these different niches bit by bit starving out incumbent competitors via prioritization of search results.
I can can think of one analogy that would be more obviously problematic and troubling: imagine a library system spread across a large geographic area with many branches that was wealthy enough to buy their own book publisher. Maybe it starts out with non-fiction and the quality of the published books remains high. What if the library system then started to prefer this associated publisher? At one point, the library network might end up saying that the other non-fiction books published in the same time period are not being lent out enough to be worth stocking, and so removes / puts them away in long-term storage. One day this library system buys a new publisher, but this time it is fiction, but maybe just one genre like adventure or children's books. Then, the same pattern of behavior emerges.
The library system / network I described above sounds like a very useful and successful platform: almost like a public utility.
Another example might be a telegraph / telephone company anytime before ~1950 buying national newspapers.
Nobody forces you to (checks notes) create celebrity net worth databases.
I just feel like we live in a time with more information than ever, the problem clearly being inaccurate information or propaganda. The problem isn’t insufficient marginal knowledge. Obviously we live in this high knowledge world because of, not despite, Google, so it’s hard to see if you’d really be on the right side of justice if you’re against “big knowledge”
> Nobody forces you to (checks notes) create celebrity net worth databases.
So? What does it have to do with having rights over it? You can say that with literally anything.
You go on afterward arguing about inaccurate information, which has nothing to do with the current subject, and still fail to see why it would be important to protect revenue source over information that was curated...
How is Google "Big Knowledge" when the actively play a role in deciding what knowledge gets disseminated or buried? You said yourself the problem is inaccurate information or propaganda which Google is a part of if they're suppressing information they don't like.
And these numbers aren't "facts". They're the result of research based on estimates from information about movie deals, real estate investments, business deals, combined with some methodology for producing an estimate based on that. This is particular creative expression of an idea, not just a compilation of facts.
Taking someone else's copy written content and presenting it as your own can be seen as content theft. Imagine if a book author did that to a chapter of a book.
When the one doing it can essential present the case that they should be allowed to do it or they will destroy your business then it's an abuse of power. When they are in a position to actually do it because they have so much of the power in the space it can be argued to be a monopoly and abuse of monopoly power.
Google isn't just copying a number. They are copying text from other sites to present.
It's much less clear in this case. What about displaying the title of the website, is that ok? Just about everyone would say yes. What about the first few sentences of the page? Ok, I think we are good here, probably ok. What about an intelligent snippet of the website that highlights your search terms? Well, ok we are starting to get a bit gray, because now for some searches we don't need to go to the website anymore. But is it wrong or illegal? Well what about formatting that in a nice way that prominently displays the answer to your query? Well now we have an issue, but how big of a leap was it to get there? But is it wrong because they targeted the website and built parsers specifically for it, or because it is done at all? I don't know. But I think it was a slippery slope that got them there.
(You should read up a bit on this; it’s obvious you have not since you call it “copy written” when the term is “copyright” and a work can therefore be “copyrighted”. The term is about who has the right to copy a given work, and has nothing to do with “writing”.)
Copying text someone else has written is indeed copyright infringement (commonly, but technically inaccurately, called “theft”). Except for the narrowly defined classes of copying called “fair use”, Google would indeed be breaking the law if they did that, but:
> Google isn't just copying a number. They are copying text from other sites to present.
That’s not what the topic was about, and therefore not what my comment was about. My comment concerned the specific case made by the PDF document, which was about Google copying individual dollar amounts, not text.
> (You should read up a bit on this; it’s obvious you have not since you call it “copy written” when the term is “copyright” and a work can therefore be “copyrighted”. The term is about who has the right to copy a given work, and has nothing to do with “writing”.)
You used someone’s grammar as a reason to piss all over a reasonable comment, merely because it didn’t 100% validate your own?
You should read up a bit on basic human decency and even confidence. Otherwise, you just look like a know it all who can’t handle learning new things. That’s weak.
Maybe you should realize that merely commenting upon someone’s word choices, inferring inexperience, and providing some elucidation, is not the same as “piss all over”. I mean, nowhere did I say anything else negative about anything in mfer’s comment; I did not even disagree with anything they said!
Ergo, please take your lecturing about “basic human decency” somewhere else.
Maybe, just maybe, you should take the criticism to mean that your intent didn't come across well and that you have an opportunity to improve how you communicate with others.
>(You should read up a bit on this; it’s obvious you have not since you call it “copy written” when the term is “copyright” and a work can therefore be “copyrighted”. The term is about who has the right to copy a given work, and has nothing to do with “writing”.)
Perhaps the parent comment was from some country where things are different, like the language.
Perhaps, but this case was explicitly about US copyright, and may even hinge on the specific intricacies on whether database copyright in US law covers this specific case.
NAL and the only information I have is from a 2 credit intellectual property class I took in college, but facts (like dollar amounts) are not copyrightable, but curated collections of facts (like an almanac) are copyrightable. It seems this would fall under the latter category. Also from the LinkedIn - hiQ Labs lawsuit, it seems that unfair business practices are factored in. Google is implementing unfair business practices that stifle competition by taking the data to make a competing product.
True, but that term means something completely different from what was discussed here, and the sentence does not make sense if you assume that meaning. It’s quite obvious to me that they, like many others, erroneously thought of “copyright” as “copywrite”.
Just because it wasn't illegal (or questionably legal) yesterday doesn't mean we can't make it illegal tomorrow.
I would surmise that the most popular laws ever written by
humanity, collectively, started with a bunch of people thinking some version of "That behavior is questionable."
So sure, the last part is bad too. But that statement was being provide to people who write new laws. Not quite understanding why the current legality of Google's copying is that significant in that context.
The "it's not illegal" game is what every dominant company or powerful mobster wants you to play. Because it's their game and they spend a great deal of money to ensure they always win.
Yes. Therefore, if we want to win, we should not play that game. Which is why I pointed to something else in the document which much more strongly implicated wrongdoing by Google.
In the case of the business in question, they employed people and spent resources to create that database. It's only fair if they get compensated for their work if the numbers that they collected is to be copied.
Well we can hope to surface avenues of monopolistic behavior that allow a company to abuse their position within the confines of the law through committees like this. Just because it follows the letter of the law doesn't mean it's not malicious or damaging.
>>but might not be illegal, since simple numbers like a dollar amount can’t be copyrighted, and AFAIK there is no other legal impediment to Google doing what they did (IANAL).
Forbes for one spends a LOT of money hiring people to try to estimate the net-worth. So it is a number--but one that someone paid a lot of money to get it. Illegal? Maybe. Not-ethical? Very much so IMO
A curated database like this can be copyrighted. The author even mentions this in the article, that the net worth numbers are not considered as commodity information like the height of a famous building.
> This is far more damning of Google abusing its monopoly, not only on search, but also on the Android platform.
No, it’s really not. I just visited celebritynetworth.com. One article took 462 http requests, loaded 9mb of content and though DOM content loaded in 434ms, it took 1 minute to finish loading everything.
The sheer amount of adware on this site is staggering.
Let’s put down the pitchforks and put on our critical thinking caps. Does serving all that adware provide anything close to an acceptable user experience?? And if not, doesn’t it look more like this website merely fell out of search engine fashion as things like performance and UX began to trump other factors?
You’ve totally glossed over the fact that there was an additional slide in the rankings after the site owner went public with this story. And what about the app being banned from the App Store?
Why would someone go through the trouble of compiling extensive research for free? Ads pay for things that people want but are unwilling to actually spend dollars on. That’s the state of things, whether we like it or not. Don’t hate the player, hate the game.
I think that again, we need to put our pitchforks down and think through this critically. Maybe someone from Google did decide that CNW is about to blow the lid on their scam and decide to shit all over its business model. But let’s step back for an instant and think over some other possibilities.
Google released some pretty major core updates in 2018 and 2019. A lot of these updates seemed to focus on what the SEO community has taken to refer to as EAT (expertise, authority and trust). Google’s own guidance on some of these indicate a similar theme - they’re interested in sources, quality of analysis and other trust metrics. Some major sites saw some really massive drops over these years.
When I checked out CNW earlier on today, I read a couple of articles. There were no sources at all. The net worth calculation seems to be the individual’s reported earnings less state and federal taxes. And the content looks like it was lifted from Wikipedia - in some cases, entire sentences were reprinted word for word.
To get that, my browser had to make 426 requests, download 9mb of mostly adware and display ads almost everywhere.
Maybe this is all a big Google ploy but I think it’s more likely that their algorithm has changed to favour high quality sites as opposed to sites that used to get search performance because of valuable keywords in their domain.
As for their Android app, I’ve never owned a phone that runs Android so I can’t find and download an Android version but I decided to go down the rabbit hole and install CNW in my iPhone. They’ve apparently been ad free for two years, which is nicer than their web experience. But the content is the same kind of stuff as their website - entire sentences seem lifted directly from Wikipedia, their news section is similar, it’s just slightly repackaged stories from other sources.
Thanks to today, I now know that Takeshi 6ix9ine made $5m for a livestream which pushed his net worth to a reported $8 million. I’ve never felt so old or out of touch in my life...:)
You prove it is an acceptable user experience. Of course it's easy to scroll past ads.
When I read Encyclopedia Brittanica, there are no sources. Yes, it's a pain in the butt to grab the book off of the shelf, but it'd still be wrong for Google to copy it.
If the data is no better than Wikipedia, why didn't Google copy Wikipedia? With all their PhDs and $100M bonuses, they could not do better. They copied CNW because the CNW dataset was better. The fakes prove it. To deny the CNW's data's worth is illogical, pitchforks or no pitchforks.
That's not the issue OP is addressing. If Google is ranking them lower because it's a slow, crappy, ad-ridden site, that's understandable. It's stealing their data and presenting it as their own that's the problem here.
A site being badly designed does not mean they're suddenly not protected by copyright.
As I mentioned to another commenter, I took a deep down into the CNW rabbit hole and even downloaded their app. I read a few articles and compared them to the individuals’ Wikipedia pages. The content appears to be a mild rewrite of their Wikipedia pages and in many cases, entire sentences are copied verbatim.
As for net worth, near as I can tell, for minor celebrities, they take their reported earnings, subtract relevant taxes and call that their net worth. For major celebrities, they seem to rely on Forbes and/or Business Insider as their sources.
The joke I'm trying to make is that if they had a much nicer site design it still wouldn't be ranked higher. It would only be ranked higher if and only if Amp was used... because of bias.
I get your point but I went down a CNW rabbit hole and I’ve got to tell you, there isn’t a whole lot of original reporting being done. There are no sources and in many cases (6ix9ine is a great example), the content seems to be a rewrite of his Wikipedia page. In other cases, entire sentences are reprinted verbatim from Wikipedia. The net worth calculation is basically reported earnings less relevant taxes. For major celebrities, they seem to rely on publications like Forbes and Business Insider coverage.
Here’s a fun rabbit hole for you to go down. Find CNW’s article on 6ix9ine and compare it to his Wikipedia page. Then take his reported earnings and subtract relevant taxes.
Given enough time tinkering with the prompt, and assuming the Wired article and website were in the training data, I think GPT-3 could probably suss out the relationship between the critical comments in Wired and the net worth website. It clearly can identify sentiment and maybe from that even infer the strained relationship of the site to Google
Seems plausible that Search could autonomously and possibly without human intent begin deranking sites that directly or indirectly surface negative opinions. At some point we could quite literally have to worry about hurting the feelings of our search engines.
Just tried this through AI Dungeon to see what happens:
Prompt
You are Bob Warner, founder of CelebrityNetWorth.com, and are speaking at a hearing before the House Judiciary Subcommittee on Antitrust, Commercial and Administrative Law. The subject is Google. You start your testimony as follows:
"Thank you, Chairman Cicilline, Ranking Member Sensenbrenner, Committee Chair Nadler and Ranking Member Collins,
GPT-3 Response
for inviting me to testify today about Google's anticompetitive conduct in the search market. I am honored that my company, CelebrityNetWorth.com, has been chosen by the subcommittee to speak on this important issue.
I want to begin by thanking the members of this committee for their interest in our industry. As you know, we are an information technology business that provides Internet marketing services to small businesses and individuals. We have been providing these services since 1996. Our service allows users to post advertisements on the Internet in exchange for money. In addition, we provide other services such as web hosting and e-mail services.
CelebrityNetWorth.com is an information website that provides public information about celebrities. It was founded in 2008 and has been very successful. We do not collect user information; we simply provide the public with information about celebrities. For example, on our website, you can find out how much money Kim Kardashian is worth.
We have thousands of users and track over 3 million individuals. We get a combined 6 million monthly page views. We receive an average of 75,000 unique visitors per month. We are very successful and I thank this subcommittee for the help it has given us over the years.
Our website has had many successes, but our crowning achievement was exposing Google for the shakedown racket that it is. Our story began in 2006 when we tried to enter the search engine market. At that time, Google had a near monopoly in the search engine market.
In 2009, we applied to be listed in the Google search engine as a secondary link. We went through all the proper channels and even paid money to Google. After we paid them, our website's popularity increased substantially.
Then I backed off the context quite a bit. (Note I prompt for Google in the middle) Doesn't surface the conflict noted in the PDF but makes an interesting observation about FaceBook.
Prompt
You are Bob Warner, founder of CelebrityNetWorth.com, and are having lunch with an old friend of yours named Phil. You're discussing what the last ten years has been like with the site and some of the major challenges you've faced. The conversation goes as follows:
Phil: Hey Bob, I've been thinking about how much things have changed over the years and how you've had to adapt to keep up with them. What would you say are some of the major challenges you've had keeping CelebrityNetWorth.com up and running and making money?
GPT-3 Response
Bob: Well, it's a combination of several factors. One is that we've always tried to stay on top of all the new sites popping up and trying to get our content out there first. We were also lucky enough to be in business long before most other sites were. So even though we started off small, we've managed to grow into one of the largest sites around. Another factor is that we've always kept our prices low because we didn't want to compete with free sites. Nowadays, many people can get their information for free but they still charge a lot of money for their services. So we offer a good deal for our users and we've found that they keep coming back to us.
We also realized that we had to start diversifying our income streams because advertising alone just isn't enough anymore. There's so much out there that it's actually really hard to keep people's attention, and a lot of people block ads nowadays.
Prompt
Phil: What do you think about Google?
GPT-3 Response
Bob: I've been getting a lot of questions about them. I don't think they're a threat to us because we offer way more information than they do. We're not in competition with them.
However, I'm getting a little worried about Facebook. I see them moving into our turf by creating their own "Celebrity Pages" and getting content from their users.
Google is incredibly shady. There’s been a manual penalty in place against my website The Online Slang Dictionary (http://onlineslangdictionary.com) for the better part of a decade. I know this because the data suggested it - and then a Google employee confirmed it.
It would be easy to explain it away because Aaron Peckham, owner of Urban Dictionary, worked at Google when Matt Cutts was head of the Web Spam team, and they knew each other. But that’s incredibly circumstantial. I do know that when I confronted Cutts about the penalty here on HN he lied about it - which does lead one to wonder about his motive.
> I do know that when I confronted Cutts about the penalty here on HN he lied about it
That's an extreme thing to say and crosses into personal attack. The thread you seem to be referring to (https://news.ycombinator.com/item?id=5418864) doesn't seem to support what you've said here. At a minimum, you're breaking the site guideline which says: "Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize. Assume good faith." https://news.ycombinator.com/newsguidelines.html
I'm not saying your claims about Google and your site are false—I have no idea; it sounds like you have every right to be frustrated. Also, I don't know Matt. But these sorts of accusations ought not to be slung carelessly on HN. Since we're trying to have a site that staves off internet-default outcomes, we all need to be careful about this.
I have experience with these dynamics in the HN context and can tell you that people jump to conclusions all the time about why they were penalized or banned. Their conclusions are almost always wrong and overly dramatized. Frequently they take to the forum to declaim about how badly they were treated, and they always put it in strikingly specific and factual-sounding terms, as if they know for sure. But they don't know for sure; in fact they don't know at all. They've just completely made it up.
The HN context is simpler and has lower stakes than Google. (At least here, people can get factual answers to specific questions if they ask.) But that only strengthens the point: these domains are complicated and the dramatic explanations that people come up with out of frustration are almost always wrong. Of course it's frustrating is Google has penalized your site, but please don't let that boil over into accusations against specific people unless you can strictly demonstrate what you're saying.
please don't let that boil over into accusations against specific people unless you can strictly demonstrate what you're saying.
I realized that I’m using stronger verbiage here than in the article on my website. I haven’t updated the article in some time. If I recall correctly, it doesn’t contain any evidence/proof that explicitly contradicts specifically what he told me in the linked HN thread. I will update the article, and keep in mind the contents of your comment. Out of legitimate curiosity, at what level in the org chart is it okay to say that someone lied? We can say that the CEO lied. Matt was a senior employee who was also very public: he was the face of Google’s ranking algorithms.
I don't really think in terms like "at what level in the org chart is it okay", but if pressed I guess I'd have to say that it's never ok to say that someone lied when you don't have evidence of that.
Keep in mind that the word "lie" is strong medicine. It implies not only that someone said something untrue but that they deliberately intended to deceive.
I think it's fair to say HN policies around naming and shaming are pretty cut-and-dry in the comment section: that is, unless it's the subject of the article or link by the OP, it's not okay.
So from what I can tell, if someone writes an article about an employee of a company, people can say whatever they like in the comment section - anecdotes as colleagues, stories of similar environments, a mix of facts and claims. The article in question may be mostly heresy, and we've seen in the past this happen with the likes of Bezos, or very public people within our field like Stallman, or founders with questionable morals (see the latest AI controversies, or ads, or hacking, or regulations)..
But as soon as we specifically name someone in the comments, it's considered a witch-hunt. I think this is only true the first time it happens, or prior to them being targeted outside of this community.
Once someone writes a blog post with at least some facts on the subject, or the mainstream media writes about them - suddenly they're fair-game.
dang, we know this isn't going to get any easier to moderate. And I think writing broad, sweeping rules around this won't work. However, could you possibly write a post about it? You've expanded in the comments section in response to my own failure to live up to these standards (sorry again!), which I think was a good amount of information.
It's my intention to collect the moderation comments on various of these perennial topics, cherry-pick the best bits, and turn them into either some essays or perhaps a kind of super-FAQ. That's why I link to old moderation comments using hn.algolia.com; the answers have started to converge in the last few years, and maybe there's a blissful future in which we can simply link to official explanations rather than endlessly reworking them. So yes in principle, though when there will be time for it is another question.
Do you have any evidence at all for these extraordinary claims? Why would Urban Dictionary go to such lengths to suppress a competitor which posed no threat[0]?
I also notice that the homepage hasn't been updated at all for at least 8 years[1]. I suspect this may have had something to do with its decline.
I have to say, after reading those posts, Matt is looking more trustworthy than the OP.
Especially considering the definitely-not-a-sock-puppet post by jimboykin [1], an account that was created immediately after this thread and has but a single post on HN.
I mean part of this issue is that there is no feedback to the website owner, which is shown by the guy only getting a response here on HN.
Like isn't it kind of insane that the top Google search engineer provided customer service to OP, but he couldn't get a response thru a normal channel.
I have no idea what that business is with that company.
I did everything in his comment to address the claimed automated penalties: removed every single ad and every single citation from the site. No changes in ranking at all, after months. I forget how many. 6 months?
That’s because Matt Cutts was lying: there was a manual penalty in place at that time.
I have evidence - it’s in the link. I also have proof, but I’m not comfortable with publishing my communication with the Google employee right now.
I have no idea why the penalty is in place. What I describe is “incredibly circumstantial.”
You say that my site poses no threat. Is that because it hasn’t seen traffic growth or become popular? The site is no threat exactly because the penalty is in place and has been for almost a decade. That’s a long time for a website. Traffic growth was very healthy before all of this.
The home page has been updated more recently than that, but you’re right that it still has the same page structure. With the manual penalty in place, it doesn’t matter how often I update the skin of the site or the home page text: the site is dead.
Just to save everyone the effort, the closest thing to evidence amid that blog post's many wild allegations is that the pandas update changes around citations hurt the site's google flow. This isn't surprising. That hit a lot of low quality websites that were essentially arbitraging google to aggregate content duplicated elsewhere on the web. I worked for a similar business with considerably more traffic. We didn't interpret it as some directed conspiracy against us. We knew from day 1 that's a risk of developing such a site, and why original content still matters a great deal. We adapted and moved on.
The remaining proof claimed is that a whistleblower told him Matt Cutts implemented a penalty manually. But while our poster is quite willing to share the allegation, they're not willing to share the communication.
Dude - I went to your side. I looked up a random word - it had a 3 word definition with 12 seconds to interaction, at least a second of blocking content, a pageload time of 25 seconds (!! for three words!), 352 requests! 1.5MB page size, and content farm crap filling out the rest of the page.
I mean, I'm a user of the web, and I hated sites like this.
You score terribly among all major search providers I tried BTW. Your page-speed score was an F.
You have something like 50 redirect chains on a page with three words!!
Some of these chains are 20 items long (ie, 20 redirects). I mean - how does this even happen?
Quick heads up - This is going to rank terribly everywhere.
Yes, google does "penalize" content farms, and they did a big redo a while back to really try to push this type of 99% repeating content off their index. I've no doubt they hit some good sites, yours doesn't look like one at first glance, but even if I just go unlucky there were tons of sites (review sites / how to sites) that just had endless repeating content with very little originality.
How do folks at google even have time to deal with sites like this and these types of complaints "I confronted Matt" etc...
Very true. Due to the penalty, the site gets less and less traffic every year. Enabling SSL/TLS might help a bit, but I have to face facts: the site is dead.
Not to belittle your experience, but it seems this may be a bit of a negative feedback loop. Haven't updated the homepage because "what's the point I'm already punished" etc...
I spent 80 hours a week from when the penalties started in 2011 until I went broke and got outside employment in 2014. I’ve worked on it for years worth of time since then, including updating the home page. (It’s a content site. Updating the home page is a thing, but people come to the site via searches for slang terms, not for the home page...)
There is nothing I can do with the manual penalty in place, and it doesn’t make sense to throw good time after bad. It’s been 9 years since the penalties started and there’s been no improvement since then. Not only is it not healthy to continue banging my head against the wall of Google, since we’re on a site for a startup incubator: it doesn’t make monetary sense.
I suspect he was the Googleguy from the Webmasterworld.com forums around 2002-2003, when Google (the company) was telling website owners: "trust us, we only need your website data in order to index it and to make it more easily searchable, no foul play here, we're never going to become a portal like Yahoo, we're never going to become your competitors".
Suffice is to say that things didn't turn out that way, and it has always bugged me that the people directly involved in this deception (i.e. people like Matt Cutts) have never even said "we're sorry". They just took their fat pay-checks and vested their shares and they were done.
Are you saying that the only thing that Google did was to deprioritize sites with a lot of citations? Not that they deprioritized your site specifically? This kind of flag would affect every site they crawl. I feel like it's much more likely that it was made to prevent some class of abusers, and you were caught up in the scope of their net.
I mean, every time they make a change in ranking some people get hurt and some people get helped. If this is a universal change and not tied to your domain I have a hard time believing that a team would've added this flag because a former employee makes a competing site with a slightly different model from you.
Are you saying that the only thing that Google did was to deprioritize sites with a lot of citations?
That was the automated penalty. The rest of your comment is correct. The manual penalty came later.
I have a hard time believing that a team would've added this flag because a former employee makes a competing site with a slightly different model from you.
As I said, that’s incredibly circumstantial. I don’t know the motive, cause, etc. I’m not going to publish my communication with the Google employee who informed me about the manual penalty at this time, so what I’ve written in that article is the extent of the public information at this time.
You appear to have roughly the same ranking in both Bing and Google for two words that I tried ("bro", "womp womp"). Do you believe there is also a conspiracy against your website at Microsoft?
While this is all very shady, it also seems to be a poor long-term business decision for Google.
"Our revenue and profits have dropped accordingly,which has forced me to cut our staff down to a bare bones team ... the quality of our content and product have dropped"
If CNW is no longer able provide that data because of financial insolvency, then Google no longer has a source they can scrape. This smells like the poor short-term incentives that drive individual roles instead of simply unethical behavior on Google's part.
> If the penalty weren't in place, we would get about double the visitors of that hate site. Given the rise of white nationalism in the West, does it make sense that a single person within Google can destroy a dictionary, while hate sites flourish?
Going to need evidence here. Your post comes off as a bit of a one-man pity party without it.
Yeah that text is... long in the tooth. It was written before all of the deplatformings, when Google was sending tons of traffic to such sites.
Aside from that, I tried to be as neutral as possible. But the site is my life’s work so far - it’s hard to leave emotion out of it.
At this time I’m not going to publish my communication with the Google employee who informed me about the penalty, so if you think it’s a pity party, I’m okay with that. I’d rather that be the case than risk the livelihood of someone who did the right thing.
> I do know that when I confronted Cutts about the penalty here on HN he lied about it - which does lead one to wonder about his motive
The preamble to this post hints of paranoia, but this is where we know for sure. There's no reason to believe Cutts would have any personal knowledge or involvement or motivation, except that he knew another person who might have had some weak interest in maybe doing something. Of course, the poster feels that, having drawn this chain of connections, that he must be right and Cutts must be lying.
The poster feels entitled now to personally attacking a specific well-known search engineer, one who has been active here in the past. This seems like a conduct violation on here or any reasonable discussion site.
I have 20+ years of SEO experience, going back before it was commonly called SEO, before Google even existed. I cut my SEO teeth figuring out how to manipulate AltaVista rankings.
Here, I've marked up a typical content page of yours, showing what's wrong very blatantly. It took about 30 seconds to diagnose the whole site based on the repeating content farm structures and thin content problems.
That page maybe has 20 or 30 words of quality content on it. That will get you an extreme demotion from Google if you're running a whole site like that in a content farm structure.
Your site was very likely penalized due to low quality, shallow content. Google tagged it as being a content farm. No conspiracy is necessary, it's obvious why it's penalized. Any SEO expert would point out the countless problems in the first few minutes of a discussion with them.
I looked at numerous pages, all were of very low quality, and thin / shallow on content.
On one page I checked there were 471 words, most of which were repeating low value words or from low value segments of the page (eg the "Link to this slang definition" section). Google knows the page is thin on quality content, aka shallow. I found this to hold true across all the pages I checked.
On most of the pages I checked the most frequent non-common words were "definitions" and "include" - ie hyper low value content. Your typical page looks like it might have 50 or fewer words of quality non-repeating content (content unique to just that page, written by a human). That's hundreds of words shy of what it needs to be at an absolute minimum.
The pages all have repeating content farm structures (which were increasingly common 8-12 years ago) with shallow content in those areas of the page, which is something Google sought to clean up around the time you're claiming the penalty occurred, which makes perfect sense.
You built a content farm, with a lot of shallow content pages, Google dramatically changed how it treats those types of sites, and your site most likely got nuked because of it. If you build a content farm structured site, where all the pages have repeating segments and your offset of positive value only consists of dozens of words of quality text on each page, you stand zero chance of getting traffic. That started being true 8-10 years ago as Google updated its algorithms to target content farms, to wipe them out of existence. They killed a billion dollar site, eHow, circa 2010-2011 in the process:
Thanks for your feedback. It’s quite similar to other feedback I’ve gotten, from non-employees on Google’s webmaster forums.
My site is the only site on the web with legitimate definitions of slang terms and citations of their use and other data. My site offensivest.com is the only large-scale analysis of the vulgarity of text works - using data from the site. Until a couple years ago, it was the only slang thesaurus available without payment on the web. I invite you to check out similar sites to calibrate your sense of whether it’s low or high quality. For example, you could compare my site’s content, ads, and ranking to internetslang.com. https://internetslang.com/DBA-meaning-definition.asp Is representative of the site.
I don’t understand your saying that my site is a content farm. Almost every definition has been hand-edited by me. The citations are all blacked out. In terms of calibration, https://urbanthesaurus.org/ is another good site to look at. It takes its definitions directly from Urban Dictionary.
But fundamentally it boils down to: a Google employee looked it up and told me that there’s a manual penalty against my site.
> It’s quite similar to other feedback I’ve gotten, from volunteers on Google’s webmaster forums.
Then why haven't you corrected the thin content problem? Your site is drowning in content farm structures and thin content.
> My site is the only site on the web with legitimate definitions of slang terms and citations of their use and other data.
What you're saying has nothing to do with the problem. Your pages do not have enough original content on them, they're suffering from an extreme shallow content penalty. If only 5-10% of the words on your pages are original, quality content and the rest is content farm structured words, you are going to get massively penalized for that.
> Almost every definition has been hand-edited by me. The citations are all blacked out. I don’t understand your saying that my site is a content farm.
I'm saying that Google has judged it to be a content farm based on how they rank content and how they look for mass repeating structures on a site's pages vs how much high quality original content is on each page. This isn't complex SEO, this is very, very simple to diagnose. I wasn't joking when I said it took all of 30 seconds to figure out the problem with your site.
If all you have on a given page is 30 to 50 words of original text (text you wrote by hand so to speak), you will not rank, you will not get much traffic. That's SEO 101. Most of your topic page layout consists of repeating structures, such as: the header, the footer, the map, share / tweet / email section, related words, how common is this slang, how vulgar is this slang, link to this slang. That's all of zero or negative value.
One more time: ~95% of your page structure is of zero value for SEO purposes, and the only value on the pages is a few dozen words of original text. It's not enough; it's not even remotely close to being enough. That needs to be hundreds of words ideally, at a minimum.
> But fundamentally it boils down to: a Google employee looked it up and told me that there’s a manual penalty against my site.
That's very clearly not the problem with your site. It's the extremely shallow/thin content stacked against the repeating content farm structures on every page.
> And what about those conjured celebrities I added as a precaution? All five were all scraped right into Google’s search result pages. It provided undeniable proof that after being turned down, Google simply went ahead and stole the entire database of content CNW took eight years and over a million dollars to build.
I remember a few years ago Google had a big blog post about how they’d injected some fake search results to catch Bing scraping like this. It certainly seemed like they thought it was a bad thing back then.
I worked at Microsoft from 2010-2012, and the crux of the issue there was that, for Microsoft users who had the Bing toolbar installed, the toolbar would track what you were visiting in your browser and use that session/journey information to try and improve the relevancy of Bing results. If a lot of people who searched for "best pancake house" (regardless of what search engine you used) end up visiting ihop.com within the same browser session, well then Bing would want to rank ihop.com higher for that search phrase.
Google noticed this, and for unique/low-traffic search terms, was able to synthetically generate enough "fake" traffic that the "fake" traffic became the dominating signal for those terms, and therefore Bing started directing users to the fake results.
The bad thing here IMO is the level of tracking of users via this toolbar, but fundamentally this seems as bad as any other digital fingerprinting or advertiser tracking as anything else that's become common on the web. This is not to excuse Microsoft for doing a bad thing, but it really had very little to do with "scraping Google", which somehow became the popular media takeaway for this.
EDIT: a decent contemporaneous article in Wired: https://www.wired.com/2011/02/bing-copies-google/, mentioning how Microsoft was using the clickstream data from the browser/toolbar, not scraping Google results per se
That's the first I've heard of that side of the story, and I'm pretty appalled because everyone I know believes "Bing scraped Google" - which I think was how Google framed it.
I wonder how common misunderstandings like this can be prevented or treated.
Thanks for the correction - I never really heard any followup on this at the time and definitely missed this explanation. Obviously the tracking is bad but it's definitely a different thing to the way it was presented as them actually crawling a competitor.
Recently one of the lyrics sites had a similar complaint that might be the same thing happening in this case (and maybe it was the same issue for Bing's case too).
If you specify that you don't want Google to take snippets, but a third party ignores the robots.txt or meta tags and has a permissive one of its own Google can just scrape it from there.
A missing detail here is what did the textbox for the fake celebrities' net worth on Google link to?
I believe it was Genius that was using rare unicode characters in the lyrics that don't really show up when viewing the page. They were able to see that Google had those same characters in their lyric results and therefore Google was just scraping Genius.
Ah thanks, that's the case. There's definitely some wrongdoing there but it's murky who is responsible since Google actually got the lyrics from LyricFind who stole it from Genius.
And yet I constantly get CAPTCHA's just from blocking tracking cookies.
The biggest scraper of them all goes to incredible lengths to prevent scraping. How ironic.
I hope the US govt demolishes these monopolies. It's not just in web either, the closest historical precedent I can think of is the Robber Barons of the 30's
No, this is true in general. Maybe there’s been some bugs (I would be surprised, can you link these multiple examples?) but Google respects robot.txt files
Even assuming Google would not ignore that for a popular enough site (they would know!), there is no robot.txt setting that says "index my site, but please don't steal my content, thanks."
Yeah then they won't index you at all and you lose all your traffic instantly. How about a setting to index my site without scaping my content and using it all over the place without attribution
I searched for a couple of celebrities, and in one case celebritynetworth.com is the top result, in the other it is the third result.
Both queries show info boxes with the estimated net worth, both cite other pages (no link to celebritynetworth.com), and one showed another number than celebritynetworth.com.
celebritynetworth.com has a robots.txt. This file has no restriction on Google or any other bot.
Each celebrity page seems to have the following meta tag:
Given the above, it seems like Google is correctly listing celebritynetworth.com high in it's rankings, and using other resources for the info box, which is what celebritynetworth.com is asking them to do.
While it's hard to speculate about what happened in the past, it looks like Google is doing the correct thing now, and that this information isn't actually so novel in general.
"The Featured Snippet also incorporated images of the celebrity scraped from the web to create a widget that took up 40% of a desktop result page and 80% of a mobile result page. This is still how Google displays most net worth results today"
FWIW I tried "larry david net worth" and "craig david net worth" on Google from the UK and didn't get this result - Larry David was a snippet from Wikipedia, clearly attributed, and Craig David was from CNW, again clearly attributed - basically an oversize search result with the relevant bit highlighted. Not sure if this is a recent change in response to this or not
When you consider the impact/cost of an anti-trust case vs. the ease for Google to change rankings so that the evidence site (CNW) is back in the top 3 results...
Of course you would do that if you were Google... And here there's nothing illegal, it would just defuse more journalistic bombs for Google. Smart move.
I only see one solution: Make Google split the search business out so that it only provides search and nothing else. That was Google's stated 'point' of their own site and they should be held to it.
They are like a bad rash, just ripping off from businesses of all sizes and relying on their dominance in search and financial muscle to get away with it.
Siphoning off search into its own unit would defang Google's ability to use their search engine dominance to upend others business models and would also reduce the incentives to do such shady things as well.
Another solution is to make Google pay for the estimated damages caused to the website (I think it's reasonable to extrapolate growth for a few more years) as well as a massive punitive damage as well because of the callousness and deliberate manner in which they contacted the site owners AND TOLD HIM THEY WOULD TAKE WHAT THEY WANT. I hope Google is forced to pay CNW no less than one billion dollars for this obvious abuse of their position and theft of their content.
I think this quote is the most interesting in the testimony:
In June 2019, search engine analyst Rand Fishkin put together a report about Google using data from web analytics firm Jumpshot. The data show that today an estimated 48.96% of all Google searches end with the searcher NOT clicking through to a website. The same report estimates that 7% of all search clicks go to a paid ad result and 12% go to properties owned by Google’s parent company Alphabet. Moreover, those stats do not even show the full extent of the problem because the data largely relied upon desktop devices and could not track searches that took users to a Google-owned app like the YouTube or Google Maps.
I'm not sure if this is ethical and legal problem. But one thing for sure is Google has too much leverage both technically and financially.
Google has slowly increased the reliability on Google for consumers everywhere that they are essentially a monopoly which is bad for the market. It is about time a new player comes into picture with same user interface and different business model which can empower these internet entrepreneurs.
The same could be said for Microsoft, Amazon, Apple, etc. At least Google open sources their operating system, browser and are mostly responsible for Kubernetes. If people are going to have such high expectations of one company why do they not equally apply the judgement.
Because those companies are ignored while everyone jumps aboard the Google hate train, and if anything is going to be done to fix it then those changes need to be applied equally to all companies at the same time. Otherwise it does more harm than good.
Doesn't this behavior eventually hurt Google? If there's no incentive for third parties to collate, verify, and organize facts...eventually there isn't anyone to scrape. And the facts go stale.
Whether it will hurt them before the heat-death of the universe is an open question.
On an internet populated entirely by 100 people, it wouldn't take long for shady practices to hurt shady companies. On the internet we have, in which there seems to be an unending supply of people falling over themselves in their eagerness to see their content--any content--displayed on Google's pages for free, the "eventually" in your sentence is doing quite a lot of work.
Which (for worse) has been the reason to roll back all the antitrust laws that would have stopped Google from ever becoming powerful enough to do this in the first place.
Google of course argues "faster results" are good for the consumer, but if you strip away the incentives for publishing information in the first place, you quickly end up with "bad results" then "no results."
It would depend on how the websites google are collecting data from operate. For example, somewhere like wikipedia isn't harmed through google's re-use of the facts they collate.
Wikipedia is directly harmed, in the most definite way, by not being able to collect donations from the people that rely on the information it collects.
An odd witness for the very true issues people have with Google Search. Google is crediting all websites it uses information from. It might be sad for such a site but I wonder how the creator thought that this is sustainable. Sites with biographical data for famous people already existed and for them adding net worth information is easy. So if not Google had kill them, they might have being killed by IMDb oder Wikipedia.
According to the document, the net worth was unique researched data that is still yet to be accurately replicated. In other words, the company created the data, they did not simply display it. If they were sunk by IMDb copying their data then the data source goes away too, unless IMDb invests in a research team.
What happens when CNW goes offline, how will Google replace the now missing data? Remembering that celebrities constantly change net worth and new ones show up every week, so it's a moving target that someone needs to chase.
Crediting the source while providing the information has the same benefit for the user accessing that information, but not the same benefit for website presenting that information. The website loses on whatever benefit a visiting user brings in e.g. advertising revenue, lead-in to a subscription / patronage, a plea for a donation etc.
I still remain very much against ads, but I have a more sympathetic take to financing journalism as a result of reading this account. Perhaps just having the headline be free, but having the rest be behind a paywall is a better middle ground?
Surely they can find a stronger example than a spammy SEO business that didn't exist before google, couldn't exist without and doesn't exist after? Google sucks now but not finding this compelling
Information without DRM can easily become a public good: it can be copied without limits and a loss in quality. Once it's out in the open, people will access it easier, but that also makes it near-impossible for the information-owner to make a living. At the beginning, Google was beneficial to both the information-seekers and the information-owners because it connected them (while showing ads in the search results). With the "featured snippet" it increased the user experience by a little (and multiply this with the user count), while depriving the original owner of information from its entire revenue. (Much like how Spotify made music cheaper for everyone, at the expense of artists.) The nice way to solve this would be if Google gave a cut to CelebrityNetWorth, as Spotify gives to artists. (While Google would bring CNW more requests this way, their price would also be depressed; Spotify doesn't pave the way to the riches. [1])
I have Zero respect for the Google engineers and PMs who worked on this project, scraping content without attributing to the original source, and subsequent change in the search ranking of the original source.
I am not trying to defend Google or its practices.
But how is the activity of CNW researchers different from what is Google doing? Unless the author elaborates on their method, "Educated guess" sounds like scrapping content without sourcing the author in the first place.
AFAIK there's no other database that they scrapped from. Reading sources like news articles, documents etc. is very different from copying the database that someone else created.
It's like the difference of tapping Yelp's API and create a competing service and looking at Yelp, Google, Squirespace etc. to create a restorant recommendations website.
Humans can also scrap. If CNW employs researchers, they must be using sources. These sources are not referenced in their content. Either CNW is merely making up the net worth figures, or their researchers are pulling data from somewhere to provide information about 25,000+ celebrities. If the second is valid, they aren't that different from what Google is doing.
I answered your question about how CNW researchers are different. Using sources to create a knowledge base for something is very different than hooking up to a knowledge base that was already curated and displaying it as your own without compensating the people who curated it.
Yes, sure. It would be nice of them to disclose their own sources too but I fail to see how this is related to Google straight out copying their curated database and showing it to its own users.
Like, why are we discussing it? What is the agenda here? Are we trying to conclude that CNW owners are also not that nice? Are you claiming that CNW copied your own research and demand justice too? If you explain what is your point here, maybe we can have a productive discussion instead of repeating the same stuff.
Understable not to include the sources after that email exchange because if they source the articles on their site Google could just cite those articles. What isnt clear to me is where the fake ones were attributed to in the info box. For the real celebrities it seems like they're mostly from Wikipedia which cites articles (who maybe look it up from CNW?)
Indeed. When you're in the business of taking IP from small companies and calling it yours, canaries start to choke. It's like making a pink version of anything: you're done, you're out of ideas, you have ceased to innovate.
where Google 'disappeared' an entire category of apps from the App Store because one of our competitor/friend companies in the same area as us was growing very fast
That guy started talking about adding thousands of new users a day. Within 2 weeks Google had removed ALL (every single one) apps in that category
Also this is a search vertical. So it's the very definition of anti trust
*
anyways, it's good that people are talking about this
It looks more like an ethical argument than a legal one. I don't understand how this data would be protected under US law. Can someone explain if you think otherwise?
Is the issue that the data was scraped? Or is the issue Google appropriated the scraped data for use in their search results?
From my perspective it just looks like a bad business model: spend a lot of time and effort estimating net worth, and then publish it on the open web. Publicly available data can be freely and legally scraped (and I wouldn't want it any other way): https://www.eff.org/deeplinks/2019/09/victory-ruling-hiq-v-l...
> It looks more like an ethical argument than a legal one. I don't understand how this data would be protected under US law. Can someone explain if you think otherwise?
That question kind of misses the point. This was a US House hearing. The right question, in the context of such a hearing, is whether it should be legal. The other thing to keep in mind is that this was the action of a company with considerable market power. There's not much that's off limits in principle once it turns into an antitrust case.
Having said that:
> it just looks like a bad business model
seems about right to me. No other companies had done this, and that might be because of the difficulty of monetizing numbers. Few people are going to pay for the information, but it's ultimately just a number. Consumers are better off the faster they get the number they're looking for.
This has been settled case law for decades thanks to sports almanacs. Facts (like someones net worth) cannot be copyrighted. However a compilation of facts can be.
If CNW had published a book, Google purchased it off of Amazon, and paid an intern to retype it in to fact boxes - that is clear copyright violation.
The fact that Google bought the data from a third party data broker who scraped CNW does not absolve them of liability in my mind.
I'm not sure the data in question here would qualify as "facts" like a batting average or win percentage would. These are researched estimates. Different outlets would come up with different numbers depending on the researchers and techniques they employ. Ergo it requires a degree of skill and creativity.
But as you say, the point is moot. The collection itself should certainly be covered by copyright whether its contents are a mere collection of "facts" or not.
> Facts (like someones net worth) cannot be copyrighted. However a compilation of facts can be.
It’s my understanding that this ceased to be true in the USA in 1991 and that a minimal degree of creativity is required [0]. Other countries have started following suit more recently.
What you are saying about compilations makes sense, but was it a compilation in that sense? My interpretation was it was just a bunch of data spread over public pages on a website. It seems like if that constitutes a compilation, it would expand copyright by a lot
The case you referenced says scraping is legal. It does not say re-publishing the scraped data is legal.
>I don't understand how this data would be protected under US law. Can someone explain if you think otherwise?
For something to be copyright-able it needs to have an element of creativity. That's why you can't copyright facts. Arguably, the net worth estimates require a degree of creativity to come up with. They aren't "facts", they are estimations based on research.
> We generated content that could not be found anywhere else on the web and that had never existed previously in response to demand in the marketplace. The information we produce to this day is unique. It is not commodity information like the height of a famous building or the time of day.
To call it the open web makes it sound like the wild west, and it's not: are Google SERPs on the open web? Can I copy the search results for 1000 words and paste it on my website? No.
It's the business model of many online blogs and sites, and get payment through ads. Of course it's not a really strong model now that ads revenue are going down. It was a strong one not so long ago.
Now it's clearly IPR theft, present it how you like, but theft is theft.
Plus it's not really smart from Google given their prior contact and request to this company on APIs and providing it for free.
Now the real question: Is Google going to go away with it blaming it on "rogue employee" and continuing on dispensing huge sum of cash and free advisory to gov and internet NGOs in their constant advocacy effort...
1. Google doesn't show in-line'd data box with Net worth
2. Google asks site (CNW) for access, is denied.
3. Google then shows site's data in data box (with CNW site proof of theft with canary tokens being the fake celebrity's net worth) that are now presented as Google's data (no source or fake source), without license from original site.
If that isn't theft of data, what is it?
From the PDFs, there is no license for free re-use ;-)
While I'm all-in for freedom of information and open source documentation license or open data licenses, I fully understand the need of businesses to protect their intellectual property and rights.
You keep calling it “theft”. It is obviously not literally theft, since no physical object has been taken. The only thing you could possibly mean is “copyright infringement” (which is commonly called theft in imprecise and inflammatory language), but copyright does not apply to a simple fact like a dollar amount, no matter how much research lies behind it.
You might be of the opinion that it should, but that is a different argument, which you did not make.
Ok, sorry sir ;-) Copyright infringement. What's the difference really here?
IANAL means this precisely (and on a side note, I'm here for open exchange and debate, not for claiming to be right ;-)
Interestingly, depending on jurisdictions, there are places on the globe (EU, FR) where I think the terminology is theft not copyright infringement ("theft of database"), even if it does not apply to this US National dispute. Now let's see how this goes with the US gov.
> Copyright infringement. What's the difference really here?
Upthread, the user “errantmind” asked you to provide your legal basis for it being “theft”. It’s obviously not legally speaking theft, since no physical object has been taken. But let’s be a bit charitable and assume that you meant “copyright infringement” instead. However, you have not made the case for it being copyright infringement either, since Google’s copying (which you do describe in detail) did not, AFAIK, infringe copyright (at least in the US, which is where the investigation referred to in the document is being held).
If you want to argue that what Google did is illegal, then please specify what actual law was broken.
On the other hand, if you instead want to argue that what Google did should be illegal, then please, go ahead and present your case, but this was not the original topic.
There is a significant difference, and an entire activism movement in ensuring this is recognized. Thanks for calling out (and I hope realizing) the difference.
All facts are well researched estimates. Our number for the speed of light didn't come out of the ether: someone came up with a very sophisticated experiment to get a highly accurate estimate and different experiments get different estimates. Do you think the speed of light should be copyrightable?
The speed of light is absolutely not an estimate. It's a mathematically provable fact.
Facts (in this context) are numbers that are not reasonably disputable. The value of pi, the speed of light and my current phone number are all facts.
What I happen to think the net worth of someone else is, is certainly not a fact. You might think it's something else. It would be difficult to prove either one of us right.
If you don't believe that the speed of light is a physically measured quantity, then replace that question with whatever physical constant you prefer (charge of an electron, gravitational constant, current mass of the earth, etc, etc). Do you think those numbers should be copyrightable? The charge of the electron is particularly interesting as there was actually a lot of debate over that number because some of the original experiments were flawed.
Your distinction between "measure" and "estimate" seems meaningless to me, and probably to most scientists and engineers. All physical measurements have uncertainty, and many people are engaged as we speak in creative (in the colloquial sense of the word) work seeking to reduce that uncertainty. We tend to say "measurement" when the uncertainty is negligibly small for whatever purpose is at hand, and "estimate" when it's not; but that distinction is as much in how the number will be used as in the number itself.
As to specific examples, the speed of light and charge on an electron are tricky because they're both exact by definition in SI today; but considerable creativity was required to get those definitions. The mass of the Earth is uncertain in kg today, and creative work is ongoing to decrease that uncertainty. So does that make a new estimate of the mass of the Earth copyrightable? Perhaps just the last, previously-unknown digits would be copyrightable?
I'm not aware of case law on anything too closely resembling physical constants; but I'm near-certain the answer if litigated would be no, and I'd guess the argument would be something like "you're uncovering some preexisting truth, not creating something new (no matter how 'creative' your approach to the uncovering)". Returning to the original question, the CNW guy presumably got competent legal advice and concluded the copyright angle wasn't worth pursuing, and instead focused on the antitrust angle in his testimony.
>So does that make a new estimate of the mass of the Earth copyrightable?
Likely not on its own (due to fair use). But in a "collection" of estimates for the mass of all known celestial bodies then yes, it absolutely would be copyrightable. The same way a collection of stock price targets produced by a research company are copyrightable.
I agree with the rest of your comment in general but it's relevance to the discussion is questionable. In court, the speed of light would certainly be "a fact" unless it was expressed to a precision beyond the margin of error, and that margin of error was relevant to the copyright claim. That's getting way off track from the original point since we'd be talking about a singular number instead of a collection of numbers.
Back to the original discussion: A collection of facts is not copyright-able. The poster said the collection of wealth estimates was factual and therefore not copyright-able. That's provably false since another, equally skilled researcher would come up with different numbers and they can't both be factual. Right?
>the CNW guy presumably got competent legal advice and concluded the copyright angle wasn't worth pursuing
You're trying to imply that this means he had no legal basis for his claim. I find that very disingenuous. Besides, there's no need to speculate on why he didn't pursue legal action, the reasons are clearly outlined in the write up.
Edit: I see now why the "speed of light" example is not a good one since the meter is derived from the speed of light. Perhaps pi is the best example of a simple, agreed upon, physical constant or "fact" for the purpose of this debate.
Is there case law on the price targets? That one seems yet hazier to me, given the inherent uncertainty of predictions vs. (still uncertain) estimates of current truths. I guess in a deterministic universe they're the same thing, but that seems a bit too philosophical for most lawyers' taste.
In any case, large companies with near-infinite legal budgets still don't get to (openly) violate well-settled law, since judges do see what's happening and seek to minimize the burden on their opponents--and even award legal fees where possible, which it probably would be here under the Copyright Act. Their advantage comes when the law is at least slightly unclear and a victory would be worth a lot more dollars to the large company (for the precedent) than to their opponent. I believe that's the situation here, and the reason why CNW chose not to pursue Google for copyright infringement; though even with matched legal teams, I'd still bet that CNW would lose.
Finally, if that mass of the Earth isn't copyrightable, then no concept of fair use exists, because that concept exists only for copyrighted material; so I don't think your statement there makes sense. If a copyright for the CNW dataset did exist (which I doubt, but maybe), then Google could still argue fair use. I'd bet Google would lose that one, since they're using the whole work, without transformation of the original, in a way that destroys the commercial value of the original.
“Researched estimates” that in CNW’s case are virtually fictitious. To me, the scandal is Google enabling and boosting this SEO spam in the first place.
actually this is interesting right here, because if Google takes CNW's fake celebrity, then that is definite copyright infringement - the fake celebrity was created by CNW and as such is not a fact so Google can't argue that it's not copyrighted because facts are not copyrighted.
You are correct, about Trap Streets. I'm not sure if the language is fake facts hidden in real facts, or fake streets in maps. Because it seems pretty specific to trap steets.
If the language is indeed not general beyond trap streets - if I create a fake celebrity and data regarding them, I would argue that is more like making a story. A trap street is a small line on the map (generally hidden somewhere unimportant) and a couple words. A fake celebrity requires much more content.
From the PDF "I added fivecompletely conjured celebrities to the site. I used stock photos with entirely made up names, biographies and net worth numbers. I published the pages backdated several yearsso they were nearly invisible to the world."
It seems reasonable to assume from the pdf that they put the kind of effort into an article on the fake celebrities to make them indistinguishable from real celebrities, it may be decided that this is the same as a trap street. But I don't think it is just a foregone conclusion.
You would not expect of such big business to behave in a different way. What we consumers can do is to try to use alternative services where we can, e.g. I am happy with using Firefox instead of Chrome
Many commenters are interpreting this as strictly a matter of copyright violation, and are asking the question of whether facts can be copyrighted. But that's too narrow a way of looking at this; there's also antitrust law. If Google scrapes everything interesting off of other sites and presents it as their own, even if it finds a way to do this that does not violate copyright, it's anti-competitive behavior that may violate antitrust law.
(I'm not addressing the European database protection rules which may also apply).
> Earlier this year, within weeks of the publication of a Wired magazine article that included quotes from me and a recap of our story, the CNW mobile app was banned from the Google Play store without explanation or recourse
A lot of talk here about the scrapping and capturing of the ad revenue, but I thought this was the most scary part. It’s one thing to capture ad revenue from SEO companies, it’s another thing to act like the mob and start taking retribution against anyone that speaks out against you.
I think thats the point of a monopoly. They have such a large overarching reach that they don't need to respond in a direct way that could be proved to be retribution. They respond in another avenue that affects your business in another platform they control, in ways they don't have to explain.
I think the points about people voluntarily building their businesses on Google's platform, and being thus beholden to Google's whims are well-made. The law, as it exists, doesn't seem to prevent that kind of thing since they're all private businesses, and this behavior benefits the consumer (arguably) even if it does hurt other businesses. Antitrust law (in the US) seems ill-prepared for the 21st century.
I tend to get interested in the policy concerns. Is web search so important that it should be considered a public utility, and thus regulated by the government? I think there's a case to be made there.
Oddly enough, as their practices may destroy the internet perhaps they will eventually destroy their own livelihood.
Speaking for myself, I no longer search for product reviews and comparisons. Why waste my time when all the content is fake? Probably fake due to the Google search ranking tweaks. Now I go straight to the sellers.
What is sad in all this is all the great people that work for such shitty companies and what that implies regarding the state of our morality as computer scientists and engineers.
The impression I got from reading comments here is that CelebrityNetWorth.com isn't much more than a compilation of net worth figures in a database. I don't think this is a fair analysis of the holistic value of that website.
In terms of how much of the intrinsic value of CelebrityNetWorth's website Google should be allowed to report directly in its search results, I'd say it would be fair to report CelebrityNetWorth's estimate of a celebrity's net worth so long as Google clearly attributed the data to the site that is their basis, and provided a link to the original information.
But in my opinion, the real value of CelebrityNetWorth.com in 2020 is the curated biographical content about the celebrities and their analytical pieces presented in context with their net worth estimates.
People are saying this was a bad business model, but looking at the numbers it clearly wasn’t—-unless we’re also acknowledging that the bad part of the model was relying on Google to stay true to its stated policies.
NAL, but IIRC it was alleged in the lawsuit that LinkedIn was making a competing product to hiQ. It also seems related to Associated Press vs Meltwater US Holdings, where the AP has developed a network to investigate and promulgate cutting news, and another company was copying parts of the news under "fair use", which was deemed infringement [1].
Android, Chromium and Kubernetes have come a long way. Plus there are a lot of other major open source projects that come out of Google. Way more than Microsoft, Apple and Amazon.
Yes if you compare the code base size and the core components. Microsoft still hasn't open sourced it's main consumer operating system and neither has Apple.
Yes, that's OK if it's symmetrical: why can't we easily automate Google search results?
It's asymetry here that is the problem, not Information Freedom.
And i'm pretty sure this is the line/dark pattern Google is hoping for: that people give credit to Google for "Information Freedom"... missing the real issue.
Obviously it is not just about information, but anything they can scrape bucks out of.
If it were a service they could copy, they would. Meaning probably your job as well.
So fine, they copy it and you are fine with it, you still offer a premium and theirs is a cheap copy and you have a business.
But in addition, to maximize profits, they also somehow prevent you selling your service. Like they did here by hiding visibility.
Meaning that they don't just offer same kind of thing, but prevent anyone buying anything but their own product.
And you will keep thinking this will happen to someone else, until it happens to you. It just starts from artists, writers, etc. whose work happens to be easy to copy with the current technology.
I think what author said is explained in the article, as that such approach undermines small businesses which provide informational services. They won't be able to hire data analysts and researchers anymore, hence quality of information in info boxes will decline as information gets outdated. In other words Google is stealing data presenting as it's own taking all the benefits of embedding ads data
The consumer is harmed because there is no one incentivised to create new content, or at least the incentives are severely diminished.
In the story here the celebrity net worth data had to be created and this was worth doing due to the ad revenue. Making the existing data available to consumer is a benefit but the negative is the lack of new content. So now consumer have easier access to old, inaccurate data.
I think Google is in the right here. Facts are not protected by copyright and they shouldn't be. Imagine being able to copyright the density of steel or the list of historical US presidents. It would be a disaster.
Some things are facts, like the density of steel or the list of historical US presidents. But other things aren't facts, and where something isn't a fact but instead is implicit knowledge derived from a set of both data and assumptions (like a model or an algorithm) Google should absolutely not be taking that data and treating it as their own. Given the PDF includes this quote "I decided to build a company that attempted to make educated guesses based on the sourcing of a full-time research team." I think we can assume that the data on CelebrityNetWorth wasn't "facts".
However, even if CelebrityNetWorth's data were facts, I don't believe Google should be able allowed to replicate the work that people have put in to building a website that lists facts. Google could, and eventually probably will, replace every fact based website with a sidebar in their results page. People can't compete with that. That's using Google's (de facto) search results monopoly to stop people searching the web. That's making the web worse. That's stifling innovation. I might be able to dream up some way of using CelebrityNetWorth data to improve the world (somehow...), and Google would be stopping that.
The engineers who work at Google need to stop ruining the web by trying to turn it in to their own walled garden.
This is interesting. The final outcome could be that CNW shuts down. Then, the data won’t be updated. So then google will have to drop the info box. And then someone will come along with an idea to track the net worth of celebrities...
Has Google set up some kind of hysteresis that results in the long term, repeated building and subsequent crushing of data aggregation companies?
Have you read the written statement? He says that it's not commodity information like the height of a building or the the current time.
Just like Forbes, they employed a team of fact-checkers to find and generate estimates. Knowing the exact net worth of a celebrity isn't just a fact that someone can determine and so for Google to steal that information, which is the result of research and scrutiny is wrong.
While an individual fact can't be copyrighted, a collection of them can be in certain circumstances when the compiler has put in effort to create a database of them.
The scariest thing for Google might be that generally people aren't cheering for them anymore, and many are following them closely to catch them as soon as possible.
Compare that to 10 years ago when I and others would cut them 9 yards of slack.
I am not sure why copyright law doesn't apply? Also celebrity net worth should create a trademark net worth which is obviously an approximation of the true net worth. That's different than a sports fact.. But still I don't see why copyright law doesn't apply if Google is directly sourcing material.
Anti-trust q: how is celebrity net worth "your" business if ~100% of your users came from Google in the first place?
Seems to me, you're entitled to lock up the data behind a paywall, and Google is entitled to not send you its users. Google's users can switch search engines.m and some do, e.g. I use duckduckgo for incognito.
More to the point, how would regulating Google change any of this? Who would determine a fair price for every piece of information?
My sense: to the extent you don't care about the privacy impassioned of CNW, they're a legit value- added service. The exact economic value is somewhere between $0.01 per year and $100B. The free market can be cruel, but it's the best mechanism we have for price optimization.
Personally, I'd have pushed back on Google researchers and told them that we know our content is unique, and if they want accurate results, to consider entering into a legit partnership... for example, publish N year old data for free but require captcha/ToS (excluding search engines) for current data. Individual users may not care, but search engines will want the fresh stuff as a competitive matter.
Of course, this all assumes there isn't a second CNW competitor with good-enough data, offering it for less than CNW... and if less=free, then too bad CNW...
This to me is the weakest link in the CNW business: too easy to gather good enough data cheaply and there's not enough value beyond the basic number e.g. what's the breakdown of CelebrityXs NW, NW over time etc etc - very little market for this premium content.
Put another way, i could create a website called "what's the current time in Minsk?" but its economic value would be closer to $.01/yr because it's undifferentiated and there's no premium value, so search engines can and should just display the time.
Now let's take weather: in some places the weather is pretty easy (see LA Story, with Steve Martin)
and in others (cough nyc cough) the free feeds like NWS are crap and you never know whether to pack an umbrella. Accurate weather for NYC would be amazing value to millions of people and highly differentiated. The problem is that this data is super hard to get - even current data are often wrong! But if by some miracle you had this data, no way would i publish it - instead show that you've got the goods and license it. Google might play cheapskate but the weather apps would pay up and probably uber/lyft, the city of NY etc. Once you've got a few customers then approach Google: this data would be useful across the enterprise. You'd make a killing. But oh yeah, it's really freaking hard (unlike CNW).
I read the statement and think Google used and then and abused their service, but at the same time, publishing a database of people's net worth strikes me as being pretty unethical. Would they have been able to handle a GDPR request? Why didn't they put their own and their families info up there first?
I would say it is clearly not unethical, but it can still be a bad decision. I would say it is a disadvantage for everyone in low or medium income regions.
My immediate reaction would be, "yes, it is unethical for many reasons". But again, I heard the social welfare system in these countries is best in the world. So may be I shouldn't complain.
However, if I don't want to disclose my net worth and if law is on my side in my country, then Google making my data available is definitely wrong.
Ethical is a bit of a wide term but I’m sure it’s easy to construct situations that are problematic.
E.g. an abusive ex partner can check whether you’re likely to have switched job, even from the most basic info. If tax paid is revealed then you can work out what government benefits a person might be on, including those relating to health and disability.
Isn’t the implementation in Norway such that the press have special access not afforded to individuals? That also seems pretty questionable.
Everyone have access but you have to log in with your ID and the ones you check can see that you checked them. You can check 500 people each month.
Your scenario seems pretty far fetched to me since this information will be published in October for the previous year. It contains the income and tax for the whole year only. It would be pretty difficult to know the information you suggest from the limited, delayed records.
As a Norwegian I fully support our system with public tax records.
I’m not sure how delayed records would stop you working out disability status or any other sensitive status that affects tax paid. The tax regime for any given year is presumably public info.
People are notified but that’s hardly going to stop someone seeking to upset you as in the example scenario.
Though as I understand the press can do mass lookups with no notification. Don’t you find that disparity concerning? The idea that the government gets to decide what counts as ‘press’ is already something people from countries that don’t operate that way would consider objectionable. Let alone giving them the power to decide who should be 'outed' for their tax affairs.
To be fair, that's perhaps more an issue of this particular implementation, rather than fundamental. But it's an example of how it can be used to entrench the establishment.
Everyone have pretty much unlimited access in practice so I dont see that the press have any special advantage. The press have guidelines that they should only publish the tax of public figures and I have not seen them abuse this so far.
We are a country with a high degree of social trust, and that is what I believe makes our country great. I realize that our system is not for everyone and that is fair.
Sure, I've spent some time in Norway, I get it. It's definitely very different to societies like the UK or US.
I'm not fundamentally against open tax records like this, but even in Norway there are bad actors and if we are going to play a game of 'how can this be exploited', I think it's definitely possible to do so.
I can't really agree that a 500 limit with notification is 'in practice' the same as no limit with no notification. Data mining and other techniques are possible at scale but not with only 500 records. It also makes exploitation more profitable. Harvesting data for e.g. commercial targeting is not very appealing for n = 500. It's definitely appealing at n = 5 million.
Do the press abuse it in this way? Well, they don't publish stories about doing it. But how would you really know otherwise until some scandal breaks? Are lookups by press orgs also open to public inspection? Maybe you have some freedom of information laws that would let you request that data.
Sure, almost any system can be exploited so at some point we just have to compare the advantages with the disadvantages. I am not aware of any abuse of this system. Everyone knows that their tax and income is pretty much public knowledge so they will live their life by that assumption.
Arbitrary businesses cannot access it no. However the press can gain special access for a small fee. This means that when the lists are published there are always a lot of news articles about how much various public people earn. There are limitations on how they can use it though, and they cannot publish the whole lists.
In the transcribed hearing founder makes it clear that it's more about educated guess based on publicly available information. Net worth and not the owned assets which is completely different number (as celebrities usually have rights to their personal brands which are not transferable).
In general this in theory could be challenged, but mostly based on the name and photos and not about inferred information. This might get even worse since a lot of celebrities are using stage names which might be protected as a brand but go unprotected as a personal information.
I felt sad reading the whole thing. I searched "will smith net worth" and after wikipedia snippet, the first result was celebrity net worth.
CNW depends on google, uses them to make money. Google is a search tool and wants to show whatever results makes users happy. CNW can simply remove itself from the platform and tell google never to use their data. It almost sounds like CNW is trying to bite the hand of only company made them something. I feel bad for CNW but this is natural. I've seen many of my and my clients' sites go rock bottom because of google updates.
"Google Stole My Business" sounds harsh. More like "google doesn't do what I want them to do." CNW was nothing before google. Maybe CNW could make technical changes to their content the way wallpaper/graphics sites made changes after google image search updates.
No of course not. My point is that google will change things as they did in the past. The way google snippet works is simple, they scan the data and sum it up.
I searched multiple results right now. "bill gates net worth" data was taken from Wikipedia and CNW didn't even exist on the first page.
"Beyonce net worth" showed a detailed snippet for businessinsider, and second link was CNW. Same for Eminem, 210M from groovewallet, and second result was CNW.
I just couldn't find what CNW is complaining about.
As the source for Bill Gates' net worth. Why can a free, community edited website successfully attribute it's sources but a trillion dollar company cannot?
Can you or anyone show me an example? I searched over ten celebrities and only found CNW as either 2nd or 3rd result. All snippets were giving full details to the sites.
What looks far more inculpatory to me is this, on the last page:
[…] Because it controls essentially the entire internet, Google has endless levers at its disposal to significantly harm or snuff out a rival.
I don’t think it’s a coincidence that our organic rankings have continued to suffer since I’ve become a vocal critic of their practices. Earlier this year, within weeks of the publication of a Wired magazine article that included quotes from me and a recap of our story, the CNW mobile app was banned from the Google Play store without explanation or recourse. As a result we were also banned from using the Google’s dominant Ad Mob mobile ad platform.
This is far more damning of Google abusing its monopoly, not only on search, but also on the Android platform.