BotSky

11 minute read

Published: April 14, 2025

Background

Static evaluation benchmarks for Large Langauge Models (LLMs) are outdated. Since ChatGPT’s inception in late 2022, LLMs has grown to sizes of trillions of parameters, engulfing the size of internet archives. So, obviously, testing LLM’s capability on a dataset of 3000 QA tasks is pointless. With an A.I. model that enormous, how do we align them. Or, even more fundamentally, how do we evaluate their “alignment” at all?

The leading solution right now is ChatBot Arena. It presents a simple, yet elegant, mathematical solution: elo-score. Original based on chess rankings, elo-score is used in a 1v1 ranking system. In the case of ChatBot Arena, users will selectively vote which LLM’s response is better - A or B. ChatBot Arena’s success largely derives from its collection of human responses - 2 Million plus vote. This scale of data is the bedrock of their impact. And, with this scale, ChatBot Arena becomes a reliable indicator of human preference.

However, Elo-score is both a blessing and a curse. With the beauty in elo-score’s mathematical simplicity, it can capture human preference in a binary form. It does not tell us how much an audience like this response. With this pairwise approach, it does not allow for a quantitative evaluation for “how well an LLM engages the audience”.

Engagement is a broad term. But, broadly speaking, it is a measurement of how much time and attention - our most valued resources - do we spend. In the context of social media, the amount of likes is a direct measurement of how interesting or funny an image or tweet is. The amount of reposts can also be a measurement of how popular a book is on Goodreads. Similarly, you can also measure the sentiment in the comment section to gauge whether a restaurant is worth a visit. There is an enormous trove of real-time data travelling at the speed of light in fiber optic cables connecting the globe. Therefore, social media exists as an untapped avenue for measuring LLM’s ability to engagement real humans and align with their values (humor, interests, you name it).

When I began this project, my original idea is to use X (formerly Twitter) as our de facto platform to start a bot army to post at scale.

Why did I choose Twitter?

I’ve always had an obsession with X (formerly Twitter). It’s funny. My first ever tweet was in 2017 to @ElonMusk about putting solar panels on mars. I always felt that Twitter is the most important app in the world. It presents a real-time view of what’s happening in the world at this moment. The flow of information is also insane - 50M tweets is tweeted a day. The scale and real-time nature of Twitter is what’s make it so special. And, due to it’s text-based format, it is perfect for LLMs.

However, after a few weeks of trying, there is 2 major problems for building bots on X.

Recommender System bias: In lame man terms, recommender systems select the tweets we see on our For You page. So, how do we negate these algorithmic biases that would allow post A be seen more than post B?
Lack of Engagement (& cost): When I started posting on X with a brand new bot account, I would get 200 or so impressions on each post. At this low viewership, it is not sufficient to build a based LLM evaluation out of it. In addition, the expensive X API costs that took effect after Elon Musk took over Twitter, made it costly to run bot accounts.

At this point, I was lost. I’m giving up my quest to code the LLM evaluation framework ever built.

A Renaissance

During spring break, I took a breather from this project. I loved the vision behind this project so much and believed in it so much that I can’t let it go. During my visit to the bay, I wanted to visit X’s headquarters on Market Street, because I missed my flight back to LA (long story). But, to my utmost shock, I found out Elon moved X HQ out of SF - and Dolby is the new office tenent. At this moment, my obsession to X (twitter) start to unwind. The app that I admired wasn’t what it used to be.

So, once, I returned to UC San Diego, I bought the book “Battle For The Brid” by Kurt Wagner to learn more about Twitter. In the book, Wagner discussed Jack Dorsey’s twitter spinoff - BlueSky - A decentralized open-sourced social media. Jack Dorsey intended twitter to be an sms protocol on the web. And Bluesky is just that. Right then, after releasing what I just learned, I went back to the drawing board.

The game changer is that anyone can design their recommender system - and plug it into Bluesky’s firehose. You can post anything. In this way, we can design a fair recommendar system to fairly distribute viewship - thereby able to normalize our engagement data. On a thankful note, Bluesky’s codebase is very well documented, and more importantly, unlike X, API access is free and unlimited. Now, I can build my own social media site intended for LLM evaluations.

AT Protocol

My Idea for a Solution

Jack Dorsey’s original idea of a decentralized messaging protocol presents the perfect platform to build an open-sourced LLM evaluation framework. Now, our goal is offer to entire package: an open-sourced social media site for LLM influcencers and recommenders. To enable LLMs usage to the maximum, we will provide LLM API access to anyone trying to build a bot.

On this platform, LLMs will be integrated into all conversations. We would like the ratio of bot messages to real human messages to be reasonable high (1:1). Our goal is to have an extremely high frequency of activity - and connectivity. My idea is that every time a user A posts, this post will be fed to a randomized bot in our “bot army”. Then, this bot will pull in another user (user B) who is also online and have similar interests. The following example is a sample conversation of my idea:

example

User A posts: “I love living in sf! It has the best weather of all time.”

LLM Bot Army is notified of User A’s post via our Firehose.

An LLM is chosen randomly to reply to User A’s post - chatGPT-4o

ChatGPT-4o is given information about users’ profiles, past posts, online status, etc.

ChatGPT-4o’s actual response: “That’s awesome! SF weather does have its charm 🌁. Curious what User B thinks—San Diego’s sunshine might have something to say about that! ☀️😎”

In the example, the ChatGPT-4o is performing 2 tasks:

Responding in an engaging way.
Recommending another user to take part of this conversation

Next, I will go into the technical details of how to evaluate and rank LLMs based on the two tasks mentioned above.

Engagement Evaluation

Just like originally, our rankings will be based on the quantity of likes. Now, the problem isn’t how do we normalize against recommendation system bias. Rather, the problem is how do we build a recommendation system to avoid bias in the first place.

Rather than building a recommendation system to optimize for engagement, we do something new. Unlike its social media counterparts (Instagram, Tik Tok, etc), X and Bluesky are inherent based on conversations. Data can be webbed into threads of discussion. Thus, unlike its current evolved form, Jack Dorsey intended Twitter to be a status messaging site. Therefore, we should intend to convey information by using LLM bots to add more users to the conversation, where engagement is not the center focus. But, rather a by product of it.

Recommendation Evaluation

My idea is that LLM bots would faciliate in involving participants into existing conversations. In this way, we are also evaluating LLM’s ability to recommend content to the right participant. In my idea above, I am describing a typical multistakeholder recommendation problem (Abdollahpouri et al., 2017). When LLM bots needs to recommends participants into a conversation thread, such recommendations need to satisfy all interests of multiple users for them to willingly engage.

In the above example, chatGPT-4o’s recommendation is only successful if user B responds and engages with the conversation. Therefore, we can assign a boolean variable to the success rate. An obvious formula to rank an LLM’s ability to be recommender is:

number of replies / total number of recommended @users

This project isn’t easy, but can be tremendously impacful

In short, I believe developing a fair system of recommending post is merely a technical ability. However, whether we can grow an engaging audience with such “fair” system, is the hard part.

How do we get a million viewership? A million users?

That’s the hard part. But, if accomplished, that will be our moat.

If successful, this social media site will be the first to fully utilize LLM-driven recommenders with a built in evaluation system. There is a great need for human behavioral analytics in relation to LLMs. No existing evaluation framework provide a solution to how humans engage with LLMs at scale. As A.I. gets smarter and smarter, a solution like this one can leverage the collective behavior of millions of real users and real-time data to pinpoint weaknesses in any LLMs. LLMs as social media influencers and recommenders will be the new and interesting thing.

The Future

In regards to the question How do we grow this site?, I have some ideas and comments about my vision

Empower people to build bots
- Host a prized hackathon to build the most engaging bots
- Building bots are the easiest and funnest things to do. If we can empower people to build bots, by providing them access to free LLM APIs, it could set up a nice environment to grow this site.
- The obvious downside is that we have to ensure that these bots are good quality. But even if it is not, our goal is to have HIGH frequency.
Don’t limit ourselves to academia.
- Although this project originated as an academic project - LLM evaluations, the applications for industry is enormous.
- I am open to do a start-up or something similar to make this happen. I really believe in this..
High Frequency is everything
- We cannot assume that every LLM bot will be very very engaging. But, with LLM bots, what we can guarantee is that every post can be interacted to the max.
- We shouldn’t be afraid about allowing LLM bots to post alot. We should encourage “good” spam, because the more an LLM outputs, the more data, the better.
- The site shouldn’t be a passive doom-scrolling, but a messaging app where you talk to a mix of bots and real people in real time

Share on

Twitter Facebook LinkedIn

Jason Kong

BotSky

Background

Why did I choose Twitter?

A Renaissance

My Idea for a Solution

example

Engagement Evaluation

Recommendation Evaluation

This project isn’t easy, but can be tremendously impacful

The Future

Share on

You May Also Enjoy

Artificial Morals, Not Intelligence

Jason Kong

How Well can LLMs be a Social Media Influencer? and Recommenders?

Background

Why Social Engagement?

Why did I choose Twitter?

A Renaissance

My Idea for a Solution

example

Engagement Evaluation

Recommendation Evaluation

This project isn’t easy, but can be tremendously impacful

The Future

Share on

You May Also Enjoy

Artificial Morals, Not Intelligence