In 2017, a group of researchers at OpenAI (Alec Radford, Rafal Jozefowicz, and Ilya Sutskever) made an interesting discovery when they observed the values produced by one particular neuron of a language model they had trained. The numeric value that this single neuron took, given an input sequence of words, determined with surprisingly high accuracy the sentiment of the full sequence. Manually editing the value of the sentiment neuron gave them what they called a “direct dial to control the sentiment” of the text generated by the model. Within what seemed to be a chaotic bundle of self organized weights some structure had been identified. Equipped with a partial understanding of this model users were simultaneously granted an opportunity for interpretation, control, and compression of the system as a whole.
Research over the past few years has continued to pull back the curtain on interpreting deep neural networks. One overarching direction has remained clear: as with the study of many other complex systems (e.g. those relating to physics or economics) understanding the components and how those components interact with one another often yields insight on the behavior of the aggregate system. Despite the clarity of this goal, model interpretability has demonstrated some interesting challenges on the path to broader understanding.
One difficulty in extending the insight from the Sentiment Neuron work, as Anthropic demonstrated in their Superposition work, is that not all concepts can be mapped directly to a singular neuron. Instead in many cases a single neuron is polysemantic, representing multiple semantic topics. An exciting early approach from researchers focused on finding those “Neurons in a Haystack” which were monosemantic, only representing one specific topic. Although this methodology worked in some cases it fell short of being a general purpose mechanism for interpretability and control.
Researchers including those at Anthropic decided to cast the problem in a new light – if we decompose a topic into a small group of “features” relating to multiple neurons instead of mapping a topic to one particular neuron then perhaps this relaxed problem could be solved. Anthropic’s Towards Monosemanticity work demonstrated this procedure on small transformer models by training a separate shallow neural network to learn these features. Within a year it was shown by Anthropic, OpenAI, and Deepmind that this approach or a similar one could be scaled to work on production models of sizes used by millions today. The potential to improve enterprise systems that rely on foundation models was clear as some features may relate to unsafe behaviors which could be dialed down. Other features could be activated that allowed the base model to acquire a specific personality like Golden Gate Claude, which attempts to steer conversations towards discussing the Golden Gate bridge whenever possible.
Despite the promise in introducing foundational models across an enterprise software stack, studies have shown that 91% of enterprises don’t feel very prepared to do so responsibly and 71% of senior IT leaders are concerned with generative AI safety risks. Some undesired outcomes for foundation model response generation can potentially be gated with guardrails, firewalls, and prompt engineering toolkits but the combination of interpretability and control presents an exciting opportunity to invoke a new first line of defense. A second problem that interpretability could help with is in reducing the compute cost for models which 63% of executives cite in a recent study as a top concern during adoption.
We were fortunate to meet Cyril Gorlla and Trevor Tuttle at last year’s PyTorch conference. They shared with us details on the research they had been doing at UC San Diego inspired by the path forged in model interpretability. Their newly formed company CTGT had started to build a model toolkit to customize, train, and deploy AI models in a safe and cost effective manner. The team has made significant progress as we’ve known them and have landed enterprise customers who rely on CTGT software to leverage Foundation Models within their organization.
We’re excited to announce our lead investment in CTGT’s oversubscribed $7.2M seed round which includes participation from General Catalyst, Y Combinator, Liquid 2 and notable angels including François Chollet (Google, creator of Keras), Michael Seibel (Y Combinator, co-founder Twitch), Paul Graham (Y Combinator), Peter Wang (Anaconda), Wes McKinney (creator of Pandas), Kulveer Taggar (Zeus Living) and Taner Halicioglu (first full-time Facebook employee). For those looking to improve the safety and efficiency of LLMs within their enterprise we encourage you to reach out.