LJ, blog searches, datamining
Sep. 15th, 2005 12:01 pm![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
Google's new blog search is pretty nifty if you either like searching through people's weblogs or are an egotist who likes to kiboze. I'm both. Since I've always been a shameless self-promoter and I ping all available services, index myself in search engines etc. this is just peachy.
The way LJ did it was to provide a large-scale XML data feed of Livejournal and Typepad blogs. The feed is explicitly intended for use by larger organizations who want to resyndicate or index this huge quantity of data. It's not usable by end users; it's an institutional service.
This is great if you're Google, or AOL, or an MIT grad student doing a thesis on weblogging. However, if you're an LJ user who checked the "please do not let search engines index me" button, it may be an unwelcome surprise. People who assumed a level of public presence that included friends and internet acquaintances, but not every coworker or family member who Googled them, have now discovered that the verb "to Google" now includes a well-indexed stream of all their public entries since March.
I had a frustrating conversation about this with
mendel yesterday (sorry I got ruffled there, Rich) in which I think we were both right about different things. He quite rightly pointed out that public LJ entries were subject to data mining and indexing in a number of ways already, and that the check box for blocking robots did not imply privacy to someone who understands the current state of of the Internet. Certainly my personal expectation is that anything I post, even with the lock on it, could conceivably end up as the lead story on CNN, and I proceed with that risk in mind.
And of course many of the complaints received by Six Apart about this will be from people who are misinformed about technology or the law in various countries or any number of complicated issues. I actually have no idea what U.S. law would say about what a customer can reasonably expect in this situation, and since the technologies involved about about fifteen minutes old, it may be unknown anyway.
My concern was different. Providing a massive datastream only useful to large-scale operations is qualitatively different than allowing spidering, even. Marketers, credit agencies, insurance companies, and government agencies now have an optimized tool for data mining a huge chunk of weblogs. The amount of effort required to monitor and index all of LJ and Typepad just deflated tremendously.
I am reminded, for example, of FedEx providing a stream of their tracking information to the U.S. Department of Homeland Security, or of the supermarket loyalty card information being informally turned over to the government right after 9/11/01. A recent event I posted about in which auto repair records from dealers were aggregated and sold to Carfax comes to mind. I have been told by people in the email appliance business that spammers derive a good chunk of income these days by selling verified email addresses with names attached to insurers and credit reporting agencies as additional identifying information for their records ("appends").
In short, Database Nation
(Amazon link). To my mind these changes are inevitable, irresistible, and both exciting and frightening for different reasons.
But I also think that Six Apart failed their customers, at least in the customer satisfaction/PR department, by not providing a pre-launch opt-out or removing customers who checked that box from their institutional feed.
The way LJ did it was to provide a large-scale XML data feed of Livejournal and Typepad blogs. The feed is explicitly intended for use by larger organizations who want to resyndicate or index this huge quantity of data. It's not usable by end users; it's an institutional service.
This is great if you're Google, or AOL, or an MIT grad student doing a thesis on weblogging. However, if you're an LJ user who checked the "please do not let search engines index me" button, it may be an unwelcome surprise. People who assumed a level of public presence that included friends and internet acquaintances, but not every coworker or family member who Googled them, have now discovered that the verb "to Google" now includes a well-indexed stream of all their public entries since March.
I had a frustrating conversation about this with
![[livejournal.com profile]](https://www.dreamwidth.org/img/external/lj-userinfo.gif)
And of course many of the complaints received by Six Apart about this will be from people who are misinformed about technology or the law in various countries or any number of complicated issues. I actually have no idea what U.S. law would say about what a customer can reasonably expect in this situation, and since the technologies involved about about fifteen minutes old, it may be unknown anyway.
My concern was different. Providing a massive datastream only useful to large-scale operations is qualitatively different than allowing spidering, even. Marketers, credit agencies, insurance companies, and government agencies now have an optimized tool for data mining a huge chunk of weblogs. The amount of effort required to monitor and index all of LJ and Typepad just deflated tremendously.
I am reminded, for example, of FedEx providing a stream of their tracking information to the U.S. Department of Homeland Security, or of the supermarket loyalty card information being informally turned over to the government right after 9/11/01. A recent event I posted about in which auto repair records from dealers were aggregated and sold to Carfax comes to mind. I have been told by people in the email appliance business that spammers derive a good chunk of income these days by selling verified email addresses with names attached to insurers and credit reporting agencies as additional identifying information for their records ("appends").
In short, Database Nation
But I also think that Six Apart failed their customers, at least in the customer satisfaction/PR department, by not providing a pre-launch opt-out or removing customers who checked that box from their institutional feed.
(no subject)
Date: 2005-09-15 07:12 pm (UTC)(no subject)
Date: 2005-09-15 07:14 pm (UTC)Granted, robots.txt is only a suggestion. But most spiders are ethical enough to abide by it.
(no subject)
Date: 2005-09-15 10:38 pm (UTC)(no subject)
Date: 2005-09-15 07:15 pm (UTC)(no subject)
Date: 2005-09-15 07:21 pm (UTC)(no subject)
Date: 2005-09-15 07:20 pm (UTC)(no subject)
Date: 2005-09-15 07:22 pm (UTC)(no subject)
Date: 2005-09-15 07:26 pm (UTC)(no subject)
Date: 2005-09-15 10:41 pm (UTC)(no subject)
Date: 2005-09-15 10:52 pm (UTC)(no subject)
Date: 2005-09-15 07:43 pm (UTC)(no subject)
Date: 2005-09-15 07:48 pm (UTC)(no subject)
Date: 2005-09-15 07:53 pm (UTC)(no subject)
Date: 2005-09-15 08:24 pm (UTC)(no subject)
Date: 2005-09-15 08:26 pm (UTC)And the answer, as it turns out, is just a bit down the page.
(no subject)
Date: 2005-09-15 07:59 pm (UTC)(no subject)
Date: 2005-09-15 08:05 pm (UTC)(no subject)
Date: 2005-09-15 08:27 pm (UTC)- Publish feeds - I can think of good reasons why someone would make this choice.
- Make my journal searchable (no matter how that's accessed)
All four combinations of these two features make sense.(no subject)
Date: 2005-09-15 08:00 pm (UTC)Still; I'm of the mindset that everything I type here is public and have a strong grasp of the technology involved; not the kind of person who would be complaining to Six Apart. It is hard for me to imagine that there are people dumb enough to not know that everything they type on a public webpage on the internet isn't locatable by Google (or other search engines), but I guess they *do* exist somewhere.
Allowing people to opt out of the feed really isn't a solution--in much the same was as XOR "encryption" or security through obscurity is not a solution. The data is still there and public and easily obtainable by someone with bandwidth. Other non-Google search engines that directly spider the site instead of using the feed will catch those entries. If the data were more valuable, a sort of gray-market rift would be created where official, fine, upstanding search engines would be missing data that more down-and-dirty spidering search engines would get--but we're talking about blogs here, so nobody cares enough.
The only real solution is to let people delete their accounts, wait until Google catches up to the deletion (which, I imagine, is faster now that they have the feed), and then forget that things like archive.org's Internet Wayback Machine exist.
Yes.
Date: 2005-09-15 08:10 pm (UTC)I still think that introducing a feature that dramatically reduces the cost and effort of datamining the blog stream, without offering people a pre-launch opt out, is a mistake. I obviously don't think that the NSA is honoring robots.txt or that they won't take the trouble to spider all of LJ, but having pretty much every marketing company with a few grand to spend on infrastructure snarf up the stream and munch happily on it is a bit too Panopticon for me.
(no subject)
Date: 2005-09-15 08:37 pm (UTC)There IS an option on the Google blog search page to prevent their bots indexing your pages, but I have no idea where to paste the code. Plus, it will take some time for it to disappear apparently.
I think Six Apart REALLY did some of us a disfavor with this one...no warning or anything?
(no subject)
Date: 2005-09-15 08:45 pm (UTC)See elsewhere in this thread for the admin console command to turn off your participation in the feed, though! :)
(no subject)
Date: 2005-09-15 08:49 pm (UTC)I know it's foolish to expect that such things won't happen on the internet, but you know...some folks tend to freeze the internet and technology at certain years.
(no subject)
Date: 2005-09-15 10:33 pm (UTC)Currently, the only available opt-out is
set synlevel level
from the console or going friends-only. Back in April and May, I chatted withSince people have brought this up again, I'm halfway through writing a spec for an X-Robots header and an XML-attribute approach. The former moves it to the HTTP level, so it works for anything sent over HTTP, and the latter provides finer (element-level) granularity.
(no subject)
Date: 2005-09-15 11:45 pm (UTC)(no subject)
Date: 2005-09-17 01:00 am (UTC)I have an update from Google's BlogSearch team over here (http://www.livejournal.com/users/brevity/327484.html).