Reddit blocking all major search engines, except Google

gedaliyah@lemmy.world · 2 months ago

Reddit blocking all major search engines, except Google

x00z@lemmy.world · 2 months ago

Hi, I’m new here. Because of the bullshit with Reddit. Greetings fellow Lemmy people.

✺roguetrick✺@lemmy.world · 2 months ago

Welcome to our shithole.

5redie8@sh.itjust.works · edit-2 2 months ago

Federated shithole(s)

ulterno@lemmy.kde.social · edit-2 2 months ago

More akin to a rabbit-hole, due to that.
But who said rabbits don’t shit in their holes?

Oh! And the soil is transparent.

Anti Commercial-AI license CC BY-NC-SA 4.0

Excrubulent@slrpnk.net · edit-2 2 months ago

Welcome! Genuine advice for a newcomer: look around, figure out what instances you like, and shift away from lemmy.world to an instance that requires a sign-up request and which comports with your values. There is an account migration feature to make this as easy as possible.

It’s different to what people are used to, but in my experience a huge number of the worst people migrating from reddit went straight to one of the open instances. A lot of them were banned over there for quite legitimate reasons.

They know that they can’t operate their own asshole instances for long because they’ll get defederated, and they don’t want to deal with being known to an admin who has actual principles, so open sign up is their thing, and those instances are filling up with them.

Honestly I would like to see a feature that flags if a user’s instance has open sign up.

It’s getting to the point that if someone is still on an open instance, they’re a little sus to me. It’s easier to trust people who come from instances whose policies I agree with.

P00ptart@lemmy.world · 2 months ago

Bro… What?!? I’ve only been here a day and I have no clue what any of that means lol

kautau@lemmy.world · 2 months ago

Lemmy isn’t one service like Reddit. It’s a piece of software where anybody can run their own lemmy instance. Lemmy.world is the most popular, but there are many others. And those choosing to run an instance can “federate” with other instances, which means as a user you can see posts and comments from the other instance even though you are logged into the one you have an account on.

So the commenter is recommending you look at posts or comments from users on other instances that have more stringent sign up policies, and migrate your account there. Since your account is new, you likely don’t need to spend the effort on migrating your account and instead can just set up an account on another instance/server.

But it’s also fine to stay on lemmy.world. Just be respectful, voice your opinions like you would in person with other humans, and you’ll be fine. And if you’re just here for the memes, that’s ok too! Enjoy them! And welcome to lemmy.

P00ptart@lemmy.world · 2 months ago

Hey, thanks for the detailed explanation! That certainly helps, but it’ll probably take me a while to fully get it. I signed up using voyager and it didn’t tell me anything like that. I’m sure it’ll make more sense as I get used to it. So can I not see all posts from other instances?

kautau@lemmy.world · edit-2 2 months ago

Yeah, there are many instances, and many that have purposefully been defederated by lemmy.world. Often for good reason (CSAM, an abundance of spam accounts, violent or hateful rhetoric, etc). But generally lemmy.world and its federated instances are pretty great.

P00ptart@lemmy.world · 2 months ago

Ok, sounds like I’ll just stick with .world for a while until I get my “sea legs”

✺roguetrick✺@lemmy.world · edit-2 2 months ago

I think this this guy is going to end up on dbzer0 once he gets his sea legs. Of note the piracy communities over there will be some of the few things you can’t access from .world.

hahattpro@lemmy.world · 2 months ago

Let’s two of them die together

tal@lemmy.today · 2 months ago

Blocking other search engines will hurt Reddit, all else held equal. But not by that much. Google is seriously dominant in the search engine market.

kagis

Yeah.

https://gs.statcounter.com/search-engine-market-share

According to this, Google has 91.06% of the search engine market. So for Reddit, they’re talking about cutting themselves off from a little under 9% of people searching out there. Which…I mean, it isn’t insignificant, but it isn’t likely gonna hurt them all that badly.

eronth@lemmy.world · 2 months ago

It’s also worth noting that the 9% they cut off was probably the group more inclined to already be using alternatives to Reddit anyways.

CleoTheWizard@lemmy.world · 2 months ago

You underestimate the amount of average joes that use stuff like DuckDuckGo

whatwhatwhatwhat@lemmy.world · 27 days ago

Seconding this. I work in IT, and the number of tech-illiterate people using DuckDuckGo as their default search engine is astounding. It’s got to be about 10% of our users (none of whom are in tech roles).

Dr. Moose@lemmy.world · edit-2 2 months ago

Reddit responded: “Only google pays us”. The content is not yours. You built this of naive user base that just wanted to share now these fuckers are taking it as their entitlement. As early an reddit user - fuck that place, I’m still angry.

Tja@programming.dev · 2 months ago

Legally speaking, the content is theirs.

Dr. Moose@lemmy.world · 2 months ago

No, I don’t think so. Just because you put a clause in ToS doesn’t make it legally binding and most precedent is in favor of the original copyright owner.

Jeffool @lemmy.world · 2 months ago

If someone posts a copyright violation on YouTube, YouTube can go free under the safe harbor provisions of the DMCA. (In the US.) YouTube just points a finger at the user and says “it’s their fault”, because the user owns (or claims to own) the content. YouTube is just hosting it.

I don’t know of any reason to think it’s not the same for written works. User posts them, Reddit hosts them, user still owns them. Like YouTube, the user gives the host a lot of license for that content, so that they can technically copy and transmit it. But ultimately the user owns it. I assume by the time Reddit made the AI deal they probably put in wording to include “selling a copy of the data” to active they want in the TOS.

Now, determining if the TOS holds up in court is of course trickier. And did they even make us click our permission away again after they added it, it just change something we already clicked? I don’t recall.

urquell@lemm.ee · 2 months ago

FUCK u/spez

z3rOR0ne@lemmy.ml · 2 months ago

I’ve posted this elsewhere, but it bears repeating:

Just use ddg bangs if you use Duckduckgo and you can search reddit directly.

!reddit search term

or:

!r search term

It still picks up latest posts related to reddit, it just searches reddit directly instead of searching Bing’s results. It’s that simple.

You can even use a redirect extension like Libredirect in conjunction with this Duckduckgo feature to redirect your search to a privacy respecting frontend like redlib.

Kyouki@lemmy.world · 2 months ago

DDG is awesome, been using it for years.

lennivelkant@discuss.tchncs.de · 2 months ago

I used to sneer at the kids in my class that used it. Must have been fairly shortly after it launched, something like fourteen to fifteen years ago. I’m still grappling with a certain inertia when it comes to switching away from something I have relied on for so long, but I’m coming around to the idea of giving DDG a try at least (irrational as it is, I’ve been reluctant to even try - I suspect out of fear of liking it and having to change).

Past Me would be exasperated that Present Me is even toying with the idea. But then, Past Me had a lot of stupid takes anyway.

unconfirmedsourcesDOTgov@lemmy.sdf.org · 2 months ago

I went through the same process that you’re describing. In the end, I gave it a shot and, anecdotally, I feel like I find the things I’m looking for faster than I was with Google and with no shoddy ai summaries.

noli@lemmy.zip · 2 months ago

I like to say that DDG gives you what you searched for while google gives you what it thinks you wanted.

Vanth@reddthat.com · edit-2 2 months ago

deleted by creator

tal@lemmy.today · edit-2 2 months ago

I wonder what kind of contract they went with.

https://www.reuters.com/technology/reddit-ai-content-licensing-deal-with-google-sources-say-2024-02-22/

SAN FRANCISCO, Feb 21 (Reuters) - Social media platform Reddit has struck a deal with Google (GOOGL.O) , opens new tab to make its content available for training the search engine giant’s artificial intelligence models, three people familiar with the matter said.

The contract with Alphabet-owned Google is worth about $60 million per year, according to one of the sources.

For perspective:

https://www.cbsnews.com/news/google-reddit-60-million-deal-ai-training/

In documents filed with the Securities and Exchange Commission, Reddit said it reported net income of $18.5 million — its first profit in two years — in the October-December quarter on revenue of $249.8 million.

So if you annualize that, Reddit’s seeing revenue of about $1 billion/year, and net income of about $74 million/year.

Given that Reddit granting exclusive indexing to Google happened at about the same time, I would assume that that AI-training deal included the exclusivity indexing agreement, but maybe it’s separate.

My gut feeling is that the exclusivity thing is probably worth more than $60 million/year, that Google’s probably getting a pretty good deal. Like, Google did not buy Reddit, and Google’s done some pretty big acquisitions, like YouTube, and that’d have been another way for Google to get exclusive access. So I’d think that this deal is probably better for Google than buying Reddit. Reddit’s market capitalization is $10 billion, so Google is maybe paying 0.6% the value of Reddit per year to have exclusive training rights to their content and to be the only search engine indexing them; aside from Reddit users themselves running into content in subreddits, I’d guess that those two forms are probably the main way in which one might leverage the content there.

Plus, my impression is that the idea that a number of companies have – which may or may not be valid – is that this is the beginning of the move away from search engines. Like, the idea is that down the line, the typical person doesn’t use a search engine to find a webpage somewhere that’s a primary source to find material. Instead, they just query an AI. That compiles all the data that it can see and spits out an answer. Saves some human searcher time and reduces complexity, and maybe can solve some problems if AIs can ultimately do a better job of filtering out erroneous information than humans. We definitely aren’t there yet in 2024, but if that’s where things are going, I think that it might make a lot of strategic sense for Google. If Google can lock up major sources of training data, keep Microsoft out, then it’s gonna put Microsoft in a difficult spot if Microsoft is gunning for the same thing.

Vanth@reddthat.com · edit-2 2 months ago

deleted by creator

tal@lemmy.today · edit-2 2 months ago

If we do end up at a point without search engines, where AI does the search and summarizes an answer, what do you think their level of ability to tie back to source material will be?

I haven’t used the text-based search queries myself; I’ve used LLM software, but not for this, so I don’t know what the current situation is like. My understanding is that current approach doesn’t really permit for it. And there are two issues with that:

There isn’t a direct link between one source and what’s being generated; the model isn’t really structured so as to retain this.
Many different sources probably contribute to the answer.

All information contributes a little bit to the probability of the next word that the thing is spitting out. It’s not that the software rapidly looks through all pages out there and then finds a given single reputable source that could then cite, the way a human might. That is, you aren’t searching an enormous database when the query comes in, but repeatedly making use of a prediction that the next word in the correct response is a given word, and that probability is derived from many different sources. Maybe tens of thousands of people have made posts on a given subject; the response isn’t just a quote from one, and the generated text may appear in none of them.

To maybe put that in terms of how a human might think, place you in the generative AI’s shoes, suppose I say to you “draw a house”. You draw a house with two windows, a flowerbed out front, whatever. I say “which house is that”? You can’t tell me, because you’re not trying to remember and present one house – you’re presenting me with a synthetic aggregate of many different houses; probably all houses have mentally contributed a bit to it. Maybe you could think of a given house that you’ve seen in the past that looks a fair bit like that house, but that’s not quite what I’m asking you to tell me. The answer is really “it doesn’t reflect a single house in the real world”, which isn’t really what you want to hear.

It might be possible to basically run a traditional search for a generated response to find an example of that text, if it amounts to a quote (which it may not!)

And if Google produces some kind of “reliability score” for a given piece of material and weights the material in the training set by that (which I will guess that if they don’t now, they will), they could maybe use the reliability score to try to rank various sources when doing that backwards search for relevant sources.

But there’s no guarantee that that will succeed, because they’re ultimately synthesizing the response, not just quoting it, and because it can come from many sources. There may potentially be no one source that says what Google is handing back.

It’s possible that there will be other methods than the present ones used for generating responses in the future, and those could have very different characteristics. Like, I would not be surprised, if this takes off, if the resulting system ten years down the road is considerably more complex than what is presently being done, even if to a user, the changes under the hood aren’t really directly visible.

There’s been some discussion about developing systems that do permit for this, and I believe that if you want to read up on it, the term used is “attributability”, but I have not been reading research on it.

Vanth@reddthat.com · edit-2 2 months ago

deleted by creator

leopold@lemmy.kde.social · 2 months ago

this is just going to cause indexers to ignore robots.txt

gedaliyah@lemmy.world · 2 months ago

“We always obey the robots.txt”

A bunch of corporations that have no accountability and plenty of incentive to just ignore it and have all been caught training AI on off-limits data.

Kairos@lemmy.today · 2 months ago

They’re likely blocking user agents too, which I think also doesn’t have legal enforcement (as in DuckDuckGo can just use “Google” unless they said otherwise.

Babalugats@lemmy.world · 2 months ago

I don’t have any more info on it, but I can prove it

Mnemnosyne@sh.itjust.works · 2 months ago

I’m kind of curious to understand how they’re blocking other search engines. I was under the impression that search engines just viewed the same pages we do to search through, and the only way to ‘hide’ things from them was to not have them publicly available. Is this something that other search engines could choose to circumvent if they decided to?

Madis@lemm.ee · 2 months ago

Search engine crawlers identify themselves (user agents), so they can be prevented by both honor-based system (robots.txt) and active blocking (error 403 or similar) when attempted.

Mnemnosyne@sh.itjust.works · 2 months ago

Thank you, I understand better now. So in theory, if one of the other search engines chose to not have their crawler identify itself, it would be more difficult for them to be blocked.

tb_@lemmy.world · 2 months ago

This is where you get into the whole webscraping debate you also have with LLM “datasets”.

If you, as a website host, are detecting a ton of requests coming from a singular IP you can block said address. There are ways around that by making the requests from different IP addresses, but there are other ways to detect that too!

I’m not sure if Reddit would try to sue Microsoft or DDG if they started serving results anyway through such methods. I don’t believe it is explicitly disallowed.
But if you were hoping to deal in any way with Reddit in the future I doubt a move like this would get you in their good graces.

All that is to say; I won’t visit Reddit at all anymore now that their results won’t even show up when I search for something. This is a terrible move and will likely fracture the internet even more as other websites may look to replicate this additional source of revenue.

Azzu@lemm.ee · 2 months ago

I wish Lemmy were searchable better. The search function actually works decently well, but it’s not on the same level of actual search engines, it doesn’t seem to look for related/similar terms and also relevancy doesn’t seem right.

gedaliyah@lemmy.world · 2 months ago

I do occasionally find Lemmy in web search results. The platform is not that big (or old), but as long as it sticks around then eventually searchability will improve.

KroninJ@lemmy.world · 2 months ago

It’s still possible to search with “site:reddit.com …”

Has it been implemented yet or are they blocking non-flagged searches? Which seems odd.

tb_@lemmy.world · 2 months ago

You shouldn’t be getting any new results if you do that, older posts will/may remain indexed.

KroninJ@lemmy.world · 2 months ago

Aha. I was wondering about that possibility.

Babalugats@lemmy.world · 2 months ago

They’re also blocking posts by users who aren’t banned or even got a warning. It appears to the user as though it’s been posted, but it hasn’t.

eee@lemm.ee · 2 months ago

shadowbanning is a totally different issue that’s existed for a long time though.

Kit@lemmy.blahaj.zone · 2 months ago

Shadowbanning? Do you have more info on this?

Babalugats@lemmy.world · 2 months ago

I didn’t know there was a name for it, I don’t have anymore info on it, but I can show examples of it happening.

WolfLink@sh.itjust.works · edit-2 2 months ago

They’ve done this for a long time. It’s supposedly only supposed to be used on bots but it definitely isn’t in practice

Babalugats@lemmy.world · 2 months ago

It definitely is in practice 100%

Jimmybander@champserver.net · 2 months ago

Block Reddit!

Plopp@lemmy.world · 2 months ago

But muh porn!

Jimmybander@champserver.net · 2 months ago

Exactly. You’re addicted, Plopp.

Reddit blocking all major search engines, except Google

Reddit blocking all major search engines, except Google

Just a moment...