Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Most of the search engines now days have the advantage of being closed source (you don't know how their algorithm actually work). This makes the fight against unethical SEO practices easier.

With a distributed open search alternative the algorithm is more susceptible to exploits by malicious actors.

Having it manually curated is too much of a task for any organization. If you let user vote on the results... well, that can be exploited as well.

The information available on the internet is to big to make directories effective (like it was 20 years ago).

I still have hope this will get solved one day, but directories and open source distributed search engines are not the solution in my opinion unless there is a way to make them resistant to exploitation.



I've been thinking that the only way to get around the bad-actor (or paid agent) problem when dealing with online networks is to have some sort of distributed trust mechanism.

I feel like manually curated information is the way to go, you just have to find some way to filter out all the useless info and marketing/propaganda. You can't crowd source it because it opens up avenues for gaming the system.

The only solution I can think of is some sort of transitive trust metric that's used to filter what's presented to you. If something gets by that shouldn't have (bad info/poor quality), you update the weights in the trust network that led to that action so they are less likely to give you that in the future. I never got around to working through the math on this, however.


Domain authority is a distributed 'trust' system?

But you want 'manually curated' but not 'crowd sourced', which suggests you want an individual to or small group to find, record, and curate all pages (? or domains, or <articles>, or ...) across more than 60 Billion pages of content??

There's something like 1000 FOSS CMSs - I would be surprised if there's a million domains with relevant info to sift through just for that small field.

There's no way you're curating _all_ that without crowd sourcing.

Of course you don't have to look at everything to curate, but how are you going to filter things ... use a search engine?


Is there a way to get a list of every domain in existence?


This https://www.whoisxmlapi.com/whois-database-download.php is a start, but there's new ones every second so you're going to need to update a lot.

I think it's not possible as domains don't have to be registered necessarily - a server can serve a domain at a particular IP so long as the requesting client uses a domain that the server responds to.

Obviously if it's registered domains then you can in theory just get the list from the registrar. They probably sell the full list for a price.

I imagine you can harvest a list with sufficient resources.


Thank you! this is very helpful for me.


That's very workable.Any agent should have a private key with which it signs it's pushes. Age of an agent and score of feedback for that agent determine its ranking.Though that still leaves gaming possible with the feedback. But heavy feeback like "this is malicious content" could be moderated. (So that people cant just report stuff they don't like).


The reason I mentioned that the trust metric should be transitive and distributed is so that it prevents gaming as much as possible. You wouldn't want to have a trusted central authority (for everyone) because that could always be corrupted or gamed if it's profitable enough. Rather every individual would have a set of trusted peers with different "trust" weights for each based on the individual's perception of their trustworthiness, that could be changed over time.

This trust (weighting) should be able to propagate as a (semi-)transitive property throughout the network to take advantage of your trusted peers' trusted peers. This trust weight propagation would need to converge, and when you are served content that has been labeled incorrectly ("high-value" or "trustworthy" or whatever metric, when you don't see it that way), then your trust weights (and perhaps your peers') would need to re-update in some sort of backpropagation.

The hard part is keeping track of the trust-network in a way that is O(n^c) and having the transitive calculations also be O(n^c) at most. I'm quite sure there are ways of doing this (at least with reasonably good results) but I haven't been able to think through them.


>But heavy feeback like "this is malicious content" could be moderated. //

You're just shifting around your trust problem. You need to handle 4chan level manipulation (million of users coordinating to manipulate polls), or Scientology depth (getting thousands of people in to USA government jobs in order to get recognised as a religion). If it's "we'll catch it in moderation" then whoever wants to manipulate it just gets a moderator ...

"Super-moderation": will a dictatorship work here? I don't see how.

"Meta-moderation": you're back to bad actors manipulating things with pure numbers.


You can't get around the problem of manipulation if your trustworthiness metric for content will be the same for all people, as it is on reddit, hacker news, or Amazon for example. Having moderators just concentrates the issue into a smaller number of people and you haven't solved the central problem--manipulation is profitable.

But think of how we solve this problem in our personal interactions with other people, and this should be a clue for how to solve it with computational help. We have a pretty good idea of which people are trustworthy (or capable, or dependable, or any other characteristic) in our daily lives, and based on our interactions with them we update these internal measures of trustworthiness. If we need to get information from someone we don't know, we form a judgement of their trustworthiness based off of input from people we trust--e.g. giving a reference. This is really just Bayesian inference at its core.

We should be able to come up with a computational model for how this personal measure of trustworthiness works. It would act as a filter over content that we obtain. Throw a search engine on top of this, sure, but in the end you'd still need to get trustworthiness weights onto information if you want it to be manipulation-resistant. This labeling is what I mean by manual curation. You can't leave that up to the search engine or the aggregator because those can be gamed, like the examples you gave for aggregators and SEO for search engines have shown.


>We have a pretty good idea of which people are trustworthy (or capable, or dependable, or any other characteristic) in our daily lives //

We really don't. People get surprised all the time that someone had an affair, or cheated, or ripped someone off, or whatever. "But I trusted you" ...

It's actually relatively easy to fool people in to trusting you, as many red team members will probably confirm.

Look at someone like Boris Johnson, people are trusting him to lead the country knowing that he's well known to betray people's trust and that he even had a court case lodged against him based on his very blatant lying to the entire country. You can even watch the video of him being interviewed where the interviewers says (paraphrasing) "but we all know that's a half truth" and BoJo just pushes it and pushes it and refuses to accept that it's anything other than absolute truth.

>If we need to get information from someone we don't know, we form a judgement of their trustworthiness based off of input from people we trust--e.g. giving a reference. //

This is domain authority again - trust some domains manually, let it flow from there. If that domain trusts another domain then they link to it, trust flows to the other domain, and so on. Maintaining such trust for a long time adds to a particular domains trust factor, linking to domains not trusted by others detracts from it.


So how do _you_ make any sort of judgments based off of what people say? What information do you use to judge whether their statements are accurate? Or do you always start with the assumption that everything everyone says is suspect? What sort of information do you use to come to any sort of conclusion, and how do you determine the trustworthiness of that information?

>This is domain authority again - trust some domains manually, let it flow from there. If that domain trusts another domain then they link to it, trust flows to the other domain, and so on. Maintaining such trust for a long time adds to a particular domains trust factor, linking to domains not trusted by others detracts from it.

This can be gamed if you're able to update the trustworthiness of a domain for other people, and that's why a trust metric needs to be mostly personal, and should update dynamically based on your changing trust valuations.


Pyrrhonism, you start on the assumption that no-one [else] even exists and go from there ... ;o)

Seriously, I'm not so sure -- I try to trust first and then update that status as more information becomes available; but that's more of a religious position.

I don't think it's necessarily instructive to look at my personal modes here. I guess my main point is that if you're going to say "well humans have cracked trust, we'll just model it on that" then I think you're shooting wide of the mark.


Any trust needs some kind of root. The big problem is that you need to prevent a billion real users from being "outvoted" in that Bayesian inference by a billion fake agents (augmented by thousands of paid 'influencers') saying that spam is ham and vice versa, and ensuring that they all have good reputation.


> Having it manually curated is too much of a task for any organization.

ODP/DMOZ worked quite well while it was around. I don't think it would work equally well nowadays as a centralized project, because bad actors are so much more common today than they were in the 1990s and early 2000s; and because the Internet is so astoundingly politicized these days that people will invariably try to shame you and "call you out" for even linking to stuff that they disagree with or object to in a political sense (and there was a lot of that stuff on ODP, obviously!). But federation could be used to get around both issues.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: