Welcome to my personal notes!

ok now my % dead neurons curve is just buggin

so ugly

gonna let it keep cooking though, neither mse loss nor the auxk loss have stalled out


going to setup wandb, i am sick and tired of tensorboard


i guess it is trending in the right direction though

July 15, 2024

not really sure what to do at this point

reconstruction loss stalls out after about a day, and the aux loss seems to do little to prevent dead neurons

i am pretty sure that the only difference between my implementation and openai's is that the threshhold for dead neurons is much less?

i am at 100k steps, where openai used 10M

although i am unsure if their metric was training steps or actual tokens, because I would actually be at 25.6M (batch size is 256)


holup, number of dead neurons is decreasing???

maybe small changes yesterday had an effect, too soon to call though


yeah didn't work. now retrying to actually be 10M TOKENS, which means only ~39k instead of 100k

this might be the cause of why, once axuk kicks in, there are already so many dead neurons (i am starting auxk too late, as opposed to too early)

if this doesn't work, i wrote up an email to send to paper author as a last ditch effort

July 12, 2024

model looks pretty good now, very few dead neurons and activation frequency is very low(sparsity!)

will need to write new dataloader to look at features, since my current one doesnt save the actual tokens


actually there may be a lot of dead neurons

also, reconstruction actually isn't very good, after ~16 tokens it becomes terrible


alright i've cleaned everything up, if model doesn't work now idk what im gonna do

just gonna let it train all through tomorrow too

July 11, 2024

now model won't converge

reconstruction is really terrible:

  • base model: "Cars, also known as automobiles, are wheeled vehicles used for transportation. They are a common means of transport for..."
  • using sae reconstruction: "Cars, also known in and' the the, cars are cars. Cars are a car which cars cars cars cars cars cars..."
  • wait nevermind forgot to get rid of topk

  • REAL sae reconstruction: "Cars, or automobiles, are vehicles primarily designed for transporting people and goods, and they are a major means of..."

  • now i just need to make sure features are actually sparse (not sure how they wouldn't be)


    features are not even close to sparse, i think topk activation does not work correctly😭

    back to training😔

    looking back, it was strange that there were 0 dead neurons

    July 10, 2024

    turns out i've been shuffling the wrong dimension of my data(through the model dim instead of the batch dim)


    i think ive implemented auxk loss and topk activations correctly, but for auxk it is hard to know since neurons generally dont die till later in training

    so i basically have to wait for a while to see if it works or not

    loss is definitely smoother after correcting the data shuffling


    loss curve still has weird artifacts

    i think it still has to do with shuffling, as some text examples are really long, so even with shuffling lots of activations might contain similar features?

    every large uptick in loss coincides with new set of examples


    changed it to use 1/5 of each examples, so shuffle should be noticeably better

    ideally, each activation would be from a totally different example at a totally different time step, but that would require either a ton of time spent doing ~inference on the base model or an insane amount of storage, neither of which i have

    July 9, 2024

    https://youtube.com/playlist?list=PLJ66BAXN6D8H_gRQJGjmbnS5qCWoxJNfe&si=XqBK6P6VRr9iJgFN

    today's paper:

    https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf

    83% of my neurons are dead😔

    i guess the new loss function was not enough

    https://arxiv.org/abs/2406.04093v1

    wish i would have seen this paper 2 days ago


    openai uses same loss function as original (towards monosemanticity) anthropic paper, but new anthropic paper uses new one (which i implemented and resulted in hella dead neurons)

    there must be something i am missing re: new anthropic method, since oai uses extra stuff (only uses topK activations, auxiliary loss)

    July 8, 2024

    i think i found the memory problem: the optimizer was about 8gb on the gpu


    new personal site is up


    next project after interpretability stuff will either be agents in video games or some kind of really quick diffusion model that is interactive


    i need a better way to organize papers i want to read, maybe a page on my site would work

    July 7, 2024

    dataloader is super convoluted, but seems to be working so far

    something is wrong though, my loss curve looks like a cosine function

    model will probably have to train for a couple days... hopefully i did everything correct


    i forgot that deleting files just puts them in trash, not actually deleted them

    i have 1.3TB of deleted model activations in my trash

    July 6, 2024

    model is done, now working on efficient dataloader, which is much more of a challenge than i wouldve thought

    July 3, 2024

    the smallest SAE anthropic trained for golden gate claude had an internal dim of >1M

    that is 256x the activation dim(for my model); the toy sae i trained was only 32x larger

    may have to bring out the big guns later (cloud gpu)

    hooray! they said no resampling was need when they use new sparsity penalty!

    July 2, 2024

    re: scaling up interp

    i can now get the activations of layer N of mistral 7b on some tokens, now i just need a smart way of doing this efficiently while training SAE


    https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html

    will definitely have to be more disciplined re: training of SAE to make sure i get rid of dead neurons

    internal dim of mistral7b is 4096, which is still not super big, so THEORETICALLY model should not take too long to train

    long term goal for this project is to train model for each layer (32 in total) and release some kind of interactive site where you can play with activating different features

    goal for this week is just to get a single layer trained

    good name for this project is "Golden Gate 7b"

    July 1, 2024

    taking a break from arc-agi today, gonna get mistral-7b + training data set up to scale up sparse autoencoder

    am having a hard time finding a pure pytorch implementation of mistral-7b (need to be have fine control over individual layers so i can access activations)

    implementing it myself might be the move

    June 30, 2024

    finished basic data augmentation + tokenizer, will try some experiments to see if these improve performance


    blog post is done, some time this week i'll ship new site and start on scaling interpretability stuff to bigger open source models

    June 27, 2024

    not getting anywhere with mcts, predicting whether a solution is right in a single step is just as hard as base problem, and determining whether a solution is a bit better than another is hard

    maybe will return to it at some point

    i definitely still like the idea of training on specific example at inference time though


    ok with new strategy, am getting 60% of pixels right (for the first task, will move to others when i start seeing better results)

    this is pretty terrible considering that random guessing would do only slightly worse

    gives me a baseline though


    i think something that will probably have an outsized impact is how im doing tokenization/preparing inputs

    June 26, 2024

    ok website is pretty close to being done, as is the blog post

    time to work on arc


    current method not really working

    will continue new strategy tomorrow

    June 25, 2024

    working on ARC

    my model is buggin fr

    loss is going to the moon 😭

    architecture is way too complicated

    maybe some kind of siamese network that i partially train at inference (one side is input, other is output)

    once trained on examples, then search for output that makes test input work?


    model can easily distinguish between random noise and actual answers (very easy)

    while training, need more sophisticated way to generate incorrect answers (start with correct answer and apply random stuff)

    June 24, 2024

    re: arc

    i'd like to use this as an excuse to try out combining mcts with normal deep learning stuff, so first step is probably just pure mcts

    also starting out with the smaller puzzles (3x3) might help

    mcts wont work alone though, becuase there is no way to tell if current leaf is the final solution, so you need some kind of model that determine if a solution is correct(might be just as hard as normal problem)

    you need a model whose weights update with each example, and then can be given the test state along with a proposed solution resulting in a probability that it is correct

    is this what a "liquid" neural net is

    i suppose that for each task you could just optimize(normal gradient descent) over your examples, but there is no way it wouldn't overfit with only ~3 examples

    might work if you use a tiny model, but that wouldn't have sufficient complexity for harder tasks

    i think liquid neural nets could be the move

    the paper is pretty dense tho

    https://arxiv.org/pdf/2006.04439

    June 23, 2024

    gonna work on arc challenge before i try scaling up SAE to actual open source models (likely on 7b param models, though we'll see if i have the necessary compute)


    new site is probably about 75% done, but i'd like to finish the blog post before i ship

    June 22, 2024

    i need to learn einsum

    June 21, 2024

    letting model train way after loss is improving may have worked, distribution seems to look better


    found interpretable features!!!!

    about 1/3 of them are totally dead, but the first one i looked at seems to be the end of a sentence followed by a new sentence that begins with "The"

    the way i am looking at them is still super crude, but this is really promising

    pretty much all of the features i have looked at so far correspond to single common words like "during", "of", "to"

    nevermind, just found one that seems to be about passing rules:

    > the US and Europe,__ signing__ a deal with Pharmaceutical

    > the government__ signed__ a peace agreement with

    > this month, the Senate__ launched__ its best-known

    > Many women were reluctant to__ file__ complaints against their

    the token with the underscores around it is the token the feature fired on most


    reasonable summation would be that most features correspond to specific words, though some are more general and will fire for any synonym, which implies generalization!

    i wouldn't expect to see many features for relationships more complex that single words, since the output of the actual model is not super coherent


    based on some rough estimations, it seems like about 1/3 of the features are "interpretable", 1/3 are dead, and the rest are still kinda in superposition (they activate really often and on a bunch of seemingly unrelated tokens)

    June 20, 2024

    need to take a break from interp model (still getting weird artifacts in feature distributions), will work on website redesign

    small chance that autoencoder isnt working bc it hasnt seen enough tokens, which is scary because if it is not true it will mean i have wasted like an entire day waiting for it to train


    hilbert curve to make arc agi 1d so you can put it in temporal format

    i didnt think of that its just a really cool idea

    June 19, 2024

    idk man, the distribution of activations is all goofy

    this autoencoder way too sparse

    holup i might be goated

    June 18, 2024

    is there anything better than waking up to a beautiful loss curve whose model has been training overnight

    loss is still higher than i expected, though it makes sense since it is a single, pretty small layer

    i am now wondering if my dataset is too uniform (findings in paper found features for other languages or base64, but i think my dataset is basically wikipedia-type tokens)

    guess we'll see


    some example output:

    > It is only recently that he was compelled to return to Australia to prosper from self-government to wholesome and to cultures of central Australia.

    > In Fremont County is a lush green town named according to an article published by Smithsonian magazine.

    obviously doesn't make sense but there are still connections being made (*articles* are published by *magazines*)

    also, there is sometimes other languages in the output, so those features will actually be there

    time to start on the autoencoder!


    autoencoder is being difficult, like 80% of the neurons are dead :(

    trying to just reinitialize the weights for those every so often, but its lowkey buggin

    June 17, 2024

    re: training the single layer transformer, i could just use a pretrained one(like what the open source replication did), but i waited for like 5 hours yesterday to download a huge dataset, so i'd like to do it myself


    ok should have fully trained model by tomorrow


    https://redwoodresearch.substack.com/p/getting-50-sota-on-arc-agi-with-gpt

    ok nevermind this isn’t actually doing reasoning, just trying a bunch of solutions to see if it works


    have basic training loop working, for model of this size i should probably add some more sophisticated stuff though (learning rate schedule, proper logging/val testing, early stopping)

    i think this might be the first time training on a model has worked first try though

    June 16, 2024

    https://transformer-circuits.pub/2023/monosemantic-features#phenomenology-fsa

    the html open/close tag circuit is so cool, i have always wondered how models keep track of syntax stuff like this when writing code


    ok first step of replication is just training single layer transformer

    definitely will be smaller than what was used in the paper, but i should hopefully still get some cool results


    https://arxiv.org/pdf/2406.07394

    need to be reading more MCTS stuff, my knowledge pretty much ends at what alphago used

    June 15, 2024

    sparse autoencoders could be the move


    ok new project is recreating Towards Monosemanticity results, then eventually try to do the same for larger open source models (larger meaning ~7b params, though we'll see if i have enough compute even for that)

    https://gwern.net/forking-path

    June 14, 2024

    ok remade the first experiment, definitely helped make everything more concrete


    on a tiny model(single layer autoencoder), you can see that as sparsity increases, more features can be represented

    more sparsity = more likely to only see a single feature per example

    this is because models use polysemanticity and superposition (when a neuron encodes more than a single feature)

    with a lot of sparsity, each feature is less and less orthogonal to others, hence what looks like noise outside of the diagonal


    not sure if i will reimplement later parts of the paper, it gets kinda hairy and not super applicable to big models

    but the above is pretty cool and shows why interpretability is so hard (lots of sparsity => superposition => messy neurons that encode lots of different things)


    for the rest of today i want to finish this paper and then start on the toy monosemanticity one


    chollet episode of dwarkesh pod has completely changed my outlook on the future of LLMs

    LLMs are just memory, and we do not yet have logical reasoning

    the fact that models can’t pass the ARC benchmark is very clear evidence of this, and i had never heard of it

    June 13, 2024

    papers (especially ones with less math notation) on the kindle is definitely the move

    ok gonna try to recreate some of the visualizations from the "toy models of superposition" paper

    June 12, 2024

    a paper a day


    today's paper: Gradient-based learning applied to document recognition (original CNN paper)

    figure i should start out with things i am already familiar with to get better at reading papers in general


    i am pretty sure this is from @varepsilon ideas for projects, but a command line tool that gives a public link to local images would be fun to build

    would be pretty easy too


    https://transformer-circuits.pub/2023/monosemantic-features

    mech interp is so cool

    https://transformer-circuits.pub/2022/toy_model/index.html

    next project will be something to do with interpretability

    once i finish reading some papers i will hopefully have a better idea of what it'll be

    command line tool was way easier than i thought, literally just an imgur api wrapper

    something more robust would be better, but i probably wont even put it on github, let alone putting it on a package manager

    June 3, 2024

    https://rubiks.tylercosgrove.com/

    LGTM

    runs slow but i am ready to work on something new

    i think updating my personal website would be good, i am sick of it

    June 1, 2024

    checking if a move undoes previous one (plus some other little checks) reduces total moves checked by more than 10x

    full algo is really quick now

    maybe in the future i will go back and implement the loop to find more optimal paths, but i would rather have it run really quick than save a couple moves

    max # of moves i've seen is 25, but theoretically it could produce a 30 move solve

    30 should be the max though

    May 31, 2024


    phase 2 done

    phase 2 moves can get pretty long, but i can work on that

    algo is basically done!

    i just need to go back and forth between phase 1 and 2 to get overall move count lower


    not sure if i even need to do that though, move counts are in the low twenties, which is pretty good

    going to integrate it into the opencv part now