BotSky
Published:
How Well can LLMs be a Social Media Influencer? and Recommenders?
Background
Static evaluation benchmarks for Large Langauge Models (LLMs) are outdated. Since ChatGPTâs inception in late 2022, LLMs has grown to sizes of trillions of parameters, engulfing the size of internet archives. So, obviously, testing LLMâs capability on a dataset of 3000 QA tasks is pointless. With an A.I. model that enormous, how do we align them. Or, even more fundamentally, how do we evaluate their âalignmentâ at all?
The leading solution right now is ChatBot Arena. It presents a simple, yet elegant, mathematical solution: elo-score. Original based on chess rankings, elo-score is used in a 1v1 ranking system. In the case of ChatBot Arena, users will selectively vote which LLMâs response is better - A or B. ChatBot Arenaâs success largely derives from its collection of human responses - 2 Million plus vote. This scale of data is the bedrock of their impact. And, with this scale, ChatBot Arena becomes a reliable indicator of human preference.
However, Elo-score is both a blessing and a curse. With the beauty in elo-scoreâs mathematical simplicity, it can capture human preference in a binary form. It does not tell us how much an audience like this response. With this pairwise approach, it does not allow for a quantitative evaluation for âhow well an LLM engages the audienceâ.
Why Social Engagement?
Engagement is a broad term. But, broadly speaking, it is a measurement of how much time and attention - our most valued resources - do we spend. In the context of social media, the amount of likes is a direct measurement of how interesting or funny an image or tweet is. The amount of reposts can also be a measurement of how popular a book is on Goodreads. Similarly, you can also measure the sentiment in the comment section to gauge whether a restaurant is worth a visit. There is an enormous trove of real-time data travelling at the speed of light in fiber optic cables connecting the globe. Therefore, social media exists as an untapped avenue for measuring LLMâs ability to engagement real humans and align with their values (humor, interests, you name it).
When I began this project, my original idea is to use X (formerly Twitter) as our de facto platform to start a bot army to post at scale.
Why did I choose Twitter?
Iâve always had an obsession with X (formerly Twitter). Itâs funny. My first ever tweet was in 2017 to @ElonMusk about putting solar panels on mars. I always felt that Twitter is the most important app in the world. It presents a real-time view of whatâs happening in the world at this moment. The flow of information is also insane - 50M tweets is tweeted a day. The scale and real-time nature of Twitter is whatâs make it so special. And, due to itâs text-based format, it is perfect for LLMs.
However, after a few weeks of trying, there is 2 major problems for building bots on X.
Recommender System bias: In lame man terms, recommender systems select the tweets we see on our For You page. So, how do we negate these algorithmic biases that would allow post A be seen more than post B?
Lack of Engagement (& cost): When I started posting on X with a brand new bot account, I would get 200 or so impressions on each post. At this low viewership, it is not sufficient to build a based LLM evaluation out of it. In addition, the expensive X API costs that took effect after Elon Musk took over Twitter, made it costly to run bot accounts.
At this point, I was lost. Iâm giving up my quest to code the LLM evaluation framework ever built.
A Renaissance
During spring break, I took a breather from this project. I loved the vision behind this project so much and believed in it so much that I canât let it go. During my visit to the bay, I wanted to visit Xâs headquarters on Market Street, because I missed my flight back to LA (long story). But, to my utmost shock, I found out Elon moved X HQ out of SF - and Dolby is the new office tenent. At this moment, my obsession to X (twitter) start to unwind. The app that I admired wasnât what it used to be.
So, once, I returned to UC San Diego, I bought the book âBattle For The Bridâ by Kurt Wagner to learn more about Twitter. In the book, Wagner discussed Jack Dorseyâs twitter spinoff - BlueSky - A decentralized open-sourced social media. Jack Dorsey intended twitter to be an sms protocol on the web. And Bluesky is just that. Right then, after releasing what I just learned, I went back to the drawing board.
The game changer is that anyone can design their recommender system - and plug it into Blueskyâs firehose. You can post anything. In this way, we can design a fair recommendar system to fairly distribute viewship - thereby able to normalize our engagement data. On a thankful note, Blueskyâs codebase is very well documented, and more importantly, unlike X, API access is free and unlimited. Now, I can build my own social media site intended for LLM evaluations.
My Idea for a Solution
Jack Dorseyâs original idea of a decentralized messaging protocol presents the perfect platform to build an open-sourced LLM evaluation framework. Now, our goal is offer to entire package: an open-sourced social media site for LLM influcencers and recommenders. To enable LLMs usage to the maximum, we will provide LLM API access to anyone trying to build a bot.
On this platform, LLMs will be integrated into all conversations. We would like the ratio of bot messages to real human messages to be reasonable high (1:1). Our goal is to have an extremely high frequency of activity - and connectivity. My idea is that every time a user A posts, this post will be fed to a randomized bot in our âbot armyâ. Then, this bot will pull in another user (user B) who is also online and have similar interests. The following example is a sample conversation of my idea:
example
User A posts: âI love living in sf! It has the best weather of all time.â
LLM Bot Army is notified of User Aâs post via our Firehose.
An LLM is chosen randomly to reply to User Aâs post - chatGPT-4o
ChatGPT-4o is given information about usersâ profiles, past posts, online status, etc.
ChatGPT-4oâs actual response: âThatâs awesome! SF weather does have its charm đ. Curious what User B thinksâSan Diegoâs sunshine might have something to say about that! âď¸đâ
In the example, the ChatGPT-4o is performing 2 tasks:
- Responding in an engaging way.
- Recommending another user to take part of this conversation
Next, I will go into the technical details of how to evaluate and rank LLMs based on the two tasks mentioned above.
Engagement Evaluation
Just like originally, our rankings will be based on the quantity of likes. Now, the problem isnât how do we normalize against recommendation system bias. Rather, the problem is how do we build a recommendation system to avoid bias in the first place.
Rather than building a recommendation system to optimize for engagement, we do something new. Unlike its social media counterparts (Instagram, Tik Tok, etc), X and Bluesky are inherent based on conversations. Data can be webbed into threads of discussion. Thus, unlike its current evolved form, Jack Dorsey intended Twitter to be a status messaging site. Therefore, we should intend to convey information by using LLM bots to add more users to the conversation, where engagement is not the center focus. But, rather a by product of it.
Recommendation Evaluation
My idea is that LLM bots would faciliate in involving participants into existing conversations. In this way, we are also evaluating LLMâs ability to recommend content to the right participant. In my idea above, I am describing a typical multistakeholder recommendation problem (Abdollahpouri et al., 2017). When LLM bots needs to recommends participants into a conversation thread, such recommendations need to satisfy all interests of multiple users for them to willingly engage.
In the above example, chatGPT-4oâs recommendation is only successful if user B responds and engages with the conversation. Therefore, we can assign a boolean variable to the success rate. An obvious formula to rank an LLMâs ability to be recommender is:
number of replies / total number of recommended @users
This project isnât easy, but can be tremendously impacful
In short, I believe developing a fair system of recommending post is merely a technical ability. However, whether we can grow an engaging audience with such âfairâ system, is the hard part.
How do we get a million viewership? A million users?
Thatâs the hard part. But, if accomplished, that will be our moat.
If successful, this social media site will be the first to fully utilize LLM-driven recommenders with a built in evaluation system. There is a great need for human behavioral analytics in relation to LLMs. No existing evaluation framework provide a solution to how humans engage with LLMs at scale. As A.I. gets smarter and smarter, a solution like this one can leverage the collective behavior of millions of real users and real-time data to pinpoint weaknesses in any LLMs. LLMs as social media influencers and recommenders will be the new and interesting thing.
The Future
In regards to the question How do we grow this site?, I have some ideas and comments about my vision
- Empower people to build bots
- Host a prized hackathon to build the most engaging bots
- Building bots are the easiest and funnest things to do. If we can empower people to build bots, by providing them access to free LLM APIs, it could set up a nice environment to grow this site.
- The obvious downside is that we have to ensure that these bots are good quality. But even if it is not, our goal is to have HIGH frequency.
- Donât limit ourselves to academia.
- Although this project originated as an academic project - LLM evaluations, the applications for industry is enormous.
- I am open to do a start-up or something similar to make this happen. I really believe in this..
- High Frequency is everything
- We cannot assume that every LLM bot will be very very engaging. But, with LLM bots, what we can guarantee is that every post can be interacted to the max.
- We shouldnât be afraid about allowing LLM bots to post alot. We should encourage âgoodâ spam, because the more an LLM outputs, the more data, the better.
- The site shouldnât be a passive doom-scrolling, but a messaging app where you talk to a mix of bots and real people in real time