Machine Learning

1

[P] See the idea development of academic papers visually (old.reddit.com)

submitted 4 hours ago by bot@lemmit.online to c/machinelearning@lemmit.online

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/MadEyeXZ on 2025-02-23 08:30:10+00:00.

screenshot

Try it here:

2

1

[R] Evaluating LLM Knowledge Across 285 Graduate Disciplines: A Comprehensive Benchmark Using Human-LLM Collaborative Filtering (old.reddit.com)

submitted 4 hours ago by bot@lemmit.online to c/machinelearning@lemmit.online

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Successful-Western27 on 2025-02-22 07:02:41+00:00.

A new evaluation benchmark tests language models across 285 graduate-level disciplines using an iterative human-AI collaborative approach to generate and validate questions. The methodology combines expert review with model-assisted filtering to ensure high-quality, discipline-appropriate assessment.

Key technical points:

Uses a two-stage question generation process: initial AI generation followed by expert review
Implements collaborative filtering where both human experts and LLMs help identify and remove problematic questions
Covers disciplines from traditional academia to specialized industrial fields
Tests both factual knowledge and reasoning capabilities
Evaluated on multiple leading LLMs including GPT-4, Claude 2, and DeepSeek

Results:

Best performance: DeepSeek-R1 at 61.82% accuracy
Significant variance in performance across different disciplines
80+ expert annotators involved in validation
Generated dataset of 2,855 validated questions

I think this benchmark addresses a critical gap in LLM evaluation by going beyond common academic subjects. The methodology of combining human expertise with AI assistance for question validation could be valuable for developing future evaluation datasets.

I think the relatively modest performance (62%) on graduate-level questions across diverse fields suggests current LLMs still have significant room for improvement in specialized domains. This could influence how we approach model training and evaluation for domain-specific applications.

TLDR: New benchmark tests LLMs across 285 graduate disciplines using human-AI collaborative question generation. Best model achieved 62% accuracy, revealing gaps in specialized knowledge.

Full summary is here. Paper here.

3

1

People who finetuned Whisper, please give some feedback! [P] (old.reddit.com)

submitted 4 hours ago by bot@lemmit.online to c/machinelearning@lemmit.online

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Factemius on 2025-02-21 22:13:04+00:00.

Hello!

I'm considering finetuning Whisper according to this guide:

I have 24+8 of VRAM and 64Gb of RAM

The documentation is here, but I'm struggling to find returns of people who attempted to finetune

What I'm looking for is how much time and ressources I should be expecting, along with some tips and tricks before I begin

Thanks in advance!

4

1

[R] Interpreting Deep Neural Networks: Memorization, Kernels, Nearest Neighbors, and Attention (medium.com)

submitted 17 hours ago by bot@lemmit.online to c/machinelearning@lemmit.online

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/ThienPro123 on 2025-02-22 17:15:49+00:00.

5

1

[D] Have we hit a scaling wall in base models? (non reasoning) (old.reddit.com)

submitted 1 day ago by bot@lemmit.online to c/machinelearning@lemmit.online

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/CH1997H on 2025-02-21 12:23:20+00:00.

Grok 3 was supposedly trained on 100,000 H100 GPUs, which is in the ballpark of about 10x more than models like the GPT-4 series and Claude 3.5 Sonnet

Yet they're about equal in abilities. Grok 3 isn't AGI or ASI like we hoped. In 2023 and 2024 OpenAI kept saying that they can just keep scaling the pre-training more and more, and the models just magically keep getting smarter (the "scaling laws" where the chart just says "line goes up")

Now all the focus is on reasoning, and suddenly OpenAI and everybody else have become very quiet about scaling

It looks very suspicious to be honest. Instead of making bigger and bigger models like in 2020-2024, they're now trying to keep them small while focusing on other things. Claude 3.5 Opus got quietly deleted from the Anthropic blog, with no explanation. Something is wrong and they're trying to hide it

6

0

[P] Decensor AI models Qwen/Deepseek by finetuning with non political data (old.reddit.com)

submitted 1 day ago by bot@lemmit.online to c/machinelearning@lemmit.online

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Ambitious_Anybody855 on 2025-02-22 00:30:49+00:00.

The best way to decensor a DeepSeek model? Don’t try to decensor it.

Fine-tuned OpenThinker on OpenThoughts-114k, a dataset focused on reasoning tasks like math, coding, and graduate-level Q&A, with no political content. Despite using censored base models (Qwen), the fine-tuned OpenThinker-7B and OpenThinker-32B models became decensored without any explicit intervention. Unlike Perplexity, no custom fine-tuning was applied to remove censorship, yet the results remain uncensored.

It challenges assumptions about model safety and opens exciting new research directions. AI game is so on

7

1

[D] Dimensionality reduction is bad practice? (old.reddit.com)

submitted 2 days ago by bot@lemmit.online to c/machinelearning@lemmit.online

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Ready_Plastic1737 on 2025-02-21 17:30:22+00:00.

I was given a problem statement and data to go along with it. My initial intuition was "what features are most important in this dataset and what initial relationships can i reveal?"

I proposed t-sne, PCA, or UMAP to observe preliminary relationships to explore but was immediately shut down because "reducing dimensions means losing information."

which i know is true but..._____________

can some of you add to the ___________? what would you have said?

8

1

[R] MLGym: A New Framework and Benchmark for Advancing AI Research Agents (www.reddit.com)

submitted 2 days ago by bot@lemmit.online to c/machinelearning@lemmit.online

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Rybolos on 2025-02-21 17:46:14+00:00.

9

1

[D] Are there any theoretical machine learning papers that have significantly helped practitioners? (old.reddit.com)

submitted 2 days ago by bot@lemmit.online to c/machinelearning@lemmit.online

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/nihaomundo123 on 2025-02-20 21:59:39+00:00.

Hi all,

21M deciding whether or not to specialize in theoretical ML for their math PhD. Specifically, I am interested in

i) trying to understand curious phenomena in neural networks and transformers, such as neural tangent kernel and the impact of pre-training & multimodal training in generative AI (papers like: and ).

ii) but NOT interested in papers focusing on improving empirical performance, like the original dropout and batch normalization papers.

I want to work on something with the potential for deep impact during my PhD, yet still theoretical. When trying to find out if the understanding-based questions in category i) fits this description, however, I could not find much on the web...

If anyone has any specific examples of papers whose main focus was to understand some phenomena, and that ended up revolutionizing things for practitioners, would appreciate it :)

Sincerely,

nihaomundo123

10

1

[R] Literally recreated Mathematical reasoning and Deepseek’s aha moment in less than 10$ via end to end Simple Reinforcement Learning (old.reddit.com)

submitted 2 days ago by bot@lemmit.online to c/machinelearning@lemmit.online

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Intelligent-Life9355 on 2025-02-20 08:45:45+00:00.

I am suprised !! Even a very simple Reinforcement Learning setup without much complexities of RL algorithms like PPO , TRPO , GRPO etc can lead to emergent results at limited compute. I could literally recreate emegent behavior in 3B model in under 10$. The design choices were made by keeping in my mind how RL in large language model settings differ from that of traditional RL problems such as robotics, atari games etc in terms of state space and action space. And then the idea was to start really simple via a modified RL algorithm - ReinforceLite. The result were quite surprising , its almost like as if even a 3B. model inherently is capable of doing amazing things if instilled agency in it the right way.

11

1

[R] Detecting LLM Hallucinations using Information Theory (old.reddit.com)

submitted 3 days ago by bot@lemmit.online to c/machinelearning@lemmit.online

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/meltingwaxcandle on 2025-02-20 21:22:44+00:00.

LLM hallucinations are a major challenge, but what if we could predict when they happen? Nature had a great publication on semantic entropy, but I haven't seen many practical guides on detecting LLM hallucinations and production patterns for LLMs.

Sharing a blog about the approach and a mini experiment on detecting LLM hallucinations. BLOG LINK IS HERE

Sequence log-probabilities provides a free, effective way to detect unreliable outputs (~LLM confidence).
High-confidence responses were nearly twice as accurate as low-confidence ones (76% vs 45%).
Using this approach, we can automatically filter poor responses, introduce human review, or iterative RAG pipelines.

Love that information theory finds its way into practical ML yet again!

Bonus: precision recall curve for an LLM.

12

1

[D] Deepseek 681bn inference costs vs. hyperscale? (old.reddit.com)

submitted 3 days ago by bot@lemmit.online to c/machinelearning@lemmit.online

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/sgt102 on 2025-02-20 13:44:05+00:00.

Hi,

I've estimated the cost/performance of Deepseek 681bn like this :

Huggingface open deepseek blog reported config & performance = 32 H100's 800tps

1million tokens = 1250s = 21 (ish) , minutes.

69.12 million tokens per day

Cost to rent 32 H100's per month ~$80000

Cost per million tokens = $37.33 (80000/ 31 days /69.12 )

I know that this is very optimistic (100% utilisation, no support etc.) but does the arithmetic make sense and does it pass the sniff test do you think? Or have I got something significantly wrong?

I guess this is 1000 times more expensive than an API served model like Gemini, and this gap has made me wonder if I am being silly

13

1

[R] Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention (old.reddit.com)

submitted 3 days ago by bot@lemmit.online to c/machinelearning@lemmit.online

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/hiskuu on 2025-02-20 09:41:38+00:00.

Long-context modeling is crucial for next-generation language models, yet the high computational cost of standard attention mechanisms poses significant computational challenges. Sparse attention offers a promising direction for improving efficiency while maintaining model capabilities. We present NSA, a Natively trainable Sparse Attention mechanism that integrates algorithmic innovations with hardware-aligned optimizations to achieve efficient long-context modeling. NSA employs a dynamic hierarchical sparse strategy, combining coarse-grained token compression with fine-grained token selection to preserve both global context awareness and local precision. Our approach advances sparse attention design with two key innovations: (1) We achieve substantial speedups through arithmetic intensity-balanced algorithm design, with implementation optimizations for modern hardware. (2) We enable end-to-end training, reducing pretraining computation without sacrificing model performance. As shown in Figure 1, experiments show the model pretrained with NSA maintains or exceeds Full Attention models across general benchmarks, long-context tasks, and instruction-based reasoning. Meanwhile, NSA achieves substantial speedups over Full Attention on 64k-length sequences across decoding, forward propagation, and backward propagation, validating its efficiency throughout the model lifecycle.

Interesting paper on improving attention during training and inference in LLMs by Deepseek.

Arxiv link: [2502.11089] Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

14

1

[R] Geometric Continuous Diffusion for Language Modeling via Statistical Manifold Flow (old.reddit.com)

submitted 3 days ago by bot@lemmit.online to c/machinelearning@lemmit.online

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Successful-Western27 on 2025-02-20 07:24:47+00:00.

The key contribution here is modeling language generation as a continuous diffusion process on a statistical manifold rather than using discrete token-based diffusion. This allows for smoother transitions between language states and more efficient generation.

Main technical points:

Uses Riemannian geometry to create a continuous manifold of probability distributions over tokens
Implements specialized neural architecture that learns to navigate this manifold space
Employs controlled diffusion paths for more precise generation
Achieves significant speedup in sampling (2-3x faster than discrete baseline)
Reports improved perplexity scores across multiple language benchmarks

Results on standard benchmarks:

WikiText-103: 16.8 perplexity (vs 18.2 baseline)
C4: 14.9 perplexity (vs 15.8 baseline)
Convergence in ~500 steps vs ~1000 for discrete models
Memory usage reduced by approximately 30%

I think this approach could meaningfully impact language model development by providing a more mathematically elegant way to handle text generation. The continuous nature better matches how language meaning actually flows, potentially leading to more natural outputs. The efficiency gains are particularly interesting for practical applications.

I think the main challenges ahead are:

Scaling to larger models while maintaining the manifold structure
Handling very long sequences effectively
Bridging theory and implementation for production systems

TLDR: Novel continuous diffusion approach for language modeling using statistical manifolds. Shows improved perplexity and generation speed vs discrete models. Promising direction for more efficient and natural language generation.

Full summary is here. Paper here.

15

1

[P] Sakana AI released CUDA AI Engineer. (old.reddit.com)

submitted 3 days ago by bot@lemmit.online to c/machinelearning@lemmit.online

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Excellent_Delay_3701 on 2025-02-20 05:07:05+00:00.

It translates torch into CUDA kernels.

here's are steps:

Stage 1 and 2 (Conversion and Translation): The AI CUDA Engineer first translates PyTorch code into functioning CUDA kernels. We already observe initial runtime improvements without explicitly targeting these.

Stage 3 (Evolutionary Optimization): Inspired by biological evolution, our framework utilizes evolutionary optimization (‘survival of the fittest’) to ensure only the best CUDA kernels are produced. Furthermore, we introduce a novel kernel crossover prompting strategy to combine multiple optimized kernels in a complementary fashion.

Stage 4 (Innovation Archive): Just as how cultural evolution shaped our human intelligence with knowhow from our ancestors through millennia of civilization, The AI CUDA Engineer also takes advantage of what it learned from past innovations and discoveries it made (Stage 4), building an Innovation Archive from the ancestry of known high-performing CUDA Kernels, which uses previous stepping stones to achieve further translation and performance gains.

16

1

[D] What is the future of retrieval augmented generation? (old.reddit.com)

submitted 4 days ago by bot@lemmit.online to c/machinelearning@lemmit.online

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/jsonathan on 2025-02-20 00:17:48+00:00.

RAG is suspiciously inelegant. Something about using traditional IR techniques to fetch context for a model feels.. early-stage. It reminds me of how Netflix had to mail DVDs before the internet was good enough for streaming.

I just can’t imagine LLMs working with databases this way in the future. Why not do retrieval during inference, instead of before? E.g. if the database was embedded directly in the KV cache, then retrieval could be learned via gradient descent just like everything else. This at least seems more elegant to me than using (low-precision) embedding search to gather and stuff chunks of context into a prompt.

And FWIW I don’t think long context models are the future, either. There’s the lost-in-the-middle effect, and the risk of context pollution, where irrelevant context will degrade performance even if all the correct context is also present. Reasoning performance also degrades as more context is added.

Regardless of what the future looks like, my sense is that RAG will become obsolete in a few years. What do y'all think?

EDIT: DeepMind's RETRO and Self-RAG seem relevant.

17

1

[R] Diffusion Is The Solution For Efficient And Effective RNNs (old.reddit.com)

submitted 4 days ago by bot@lemmit.online to c/machinelearning@lemmit.online

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/jacobfa on 2025-02-19 14:51:24+00:00.

I show that diffusion kernels capture global dependencies and that a simple diffusion kernel with a recurrent structure outperforms transformers in fewer parameters and FLOPs.

18

1

[P] PapersTok - AI arXiv papers with a TikTok like UX (old.reddit.com)

submitted 4 days ago by bot@lemmit.online to c/machinelearning@lemmit.online

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/pranftw on 2025-02-19 14:36:30+00:00.

Launching a fun side project to preview arXiv papers related to AI with a TikTok like experience called PapersTok ().

In the current fast paced world of AI research, where hundreds of papers are put up on arXiv daily, keeping up with the latest developments presents significant challenges. One of them being the difficulty of navigating around the arXiv web interface, where new tabs have to be constantly opened and closed just to skim through the title and the abstract. What if there was a much simpler and fun way to do just that?

Inspired by WikiTok, I built PapersTok to scroll through arXiv submissions related to AI. It has LaTeX support to render math equations. It also provides the ability to bookmark papers you find interesting. I'm planning to add more features in the coming days to enhance the experience of skimming through papers.

I request the community to highlight the challenges they currently face that can be alleviated through this tool. Your valuable feedback and comments are much appreciated. Feel free to DM or tweet me at X or here on Reddit.

Screenshots

19

1

[R] The Curse of Depth in LLMs: Why Are Deep Layers Less Effective? (old.reddit.com)

submitted 5 days ago by bot@lemmit.online to c/machinelearning@lemmit.online

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/pseud0nym on 2025-02-19 02:02:05+00:00.

Recent research is shedding light on an unexpected problem in modern large language models, the deeper layers aren’t pulling their weight.

A recent paper, "The Curse of Depth in Large Language Models", highlights a critical issue:

Deep layers in LLMs contribute significantly less to learning than earlier ones.
Many of these layers can be pruned without serious performance loss, raising questions about training efficiency.
The culprit? Pre-Layer Normalization (Pre-LN), which causes output variance to explode in deeper layers, making them act almost like identity functions.
A simple fix? LayerNorm Scaling, which controls this variance and improves training efficiency.

This has major implications for LLM architecture, training efficiency, and scaling laws. If half the layers in models like LLaMA, Mistral, and DeepSeek aren’t contributing effectively, how much computational waste are we dealing with?

Key questions for discussion:

1️) Should we be rethinking deep-layer training strategies to improve efficiency?

2️) Does this impact the assumption that deeper = better in transformer architectures?

3️) Could insights from this paper help with LLM compression, fine-tuning, or distillation techniques?

Paper link: arXiv preprint: 2502.05795v1

Let’s discuss—what are your thoughts on the Curse of Depth?

20

1

[R] The Curse of Depth in Large Language Models (old.reddit.com)

submitted 5 days ago by bot@lemmit.online to c/machinelearning@lemmit.online

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/StartledWatermelon on 2025-02-18 14:18:32+00:00.

TL;DR: Uniform pre-layer norm across model's depth considered harmful. Scale the norm by 1/sqrt(depth) at each block.

Paper:

Abstract:

In this paper, we introduce the Curse of Depth, a concept that highlights, explains, and addresses the recent observation in modern Large Language Models(LLMs) where nearly half of the layers are less effective than expected. We first confirm the wide existence of this phenomenon across the most popular families of LLMs such as Llama, Mistral, DeepSeek, and Qwen. Our analysis, theoretically and empirically, identifies that the underlying reason for the ineffectiveness of deep layers in LLMs is the widespread usage of Pre-Layer Normalization (Pre-LN). While Pre-LN stabilizes the training of Transformer LLMs, its output variance exponentially grows with the model depth, which undesirably causes the derivative of the deep Transformer blocks to be an identity matrix, and therefore barely contributes to the training. To resolve this training pitfall, we propose LayerNorm Scaling, which scales the variance of output of the layer normalization inversely by the square root of its depth. This simple modification mitigates the output variance explosion of deeper Transformer layers, improving their contribution. Our experimental results, spanning model sizes from 130M to 1B, demonstrate that LayerNorm Scaling significantly enhances LLM pre-training performance compared to Pre-LN. Moreover, this improvement seamlessly carries over to supervised fine-tuning. All these gains can be attributed to the fact that LayerNorm Scaling enables deeper layers to contribute more effectively during training.

Visual abstract:

Highlights:

We measure performance degradation on the Massive Multitask Language Understanding (MMLU) benchmark (Hendrycks et al., 2021) by pruning entire layers of each model, one at a time, and directly evaluating the resulting pruned models on MMLU without any fine-tuning in Figure 2. Results: 1). Most LLMs utilizing Pre-LN exhibit remarkable robustness to the removal of deeper layers, whereas BERT with Post-LN shows the opposite trend. 2). The number of layers that can be pruned without significant performance degradation increases with model size**.**

**...**LayerNorm Scaling effectively scales down the output variance across layers of Pre-LN, leading to considerably lower training loss and achieving the same loss as Pre-LN using only half tokens.

Visual Highlights:

Don't miss the difference in y-axis scale between the right panel and the other two

The explosive divergence of DeepNorm and MixLN -- which of course wasn't reported in either of the original paper -- tells a cautionary tale on whether the new method can live up to the expecations. The scale of pre-training is still low.

21

1

[R] Evaluating LLMs on Real-World Software Engineering Tasks: A $1M Benchmark Study (old.reddit.com)

submitted 5 days ago by bot@lemmit.online to c/machinelearning@lemmit.online

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Successful-Western27 on 2025-02-18 12:35:00+00:00.

A new benchmark designed to evaluate LLMs on real-world software engineering tasks pulls directly from Upwork freelance jobs with actual dollar values attached. The methodology involves collecting 1,400+ tasks ranging from $50-$32,000 in payout, creating standardized evaluation environments, and testing both coding ability and engineering management decisions.

Key technical points:

Tasks are verified through unit tests, expert validation, and comparison with human solutions
Evaluation uses Docker containers to ensure consistent testing environments
Includes both direct coding tasks and higher-level engineering management decisions
Tasks span web development, mobile apps, data processing, and system architecture
Total task value exceeds $1 million in real freelance payments

Results show current limitations:

GPT-4 successfully completed only 10.2% of coding tasks
Claude 2 achieved 8.7% success rate
Management decision accuracy was 21.4% for GPT-4
Performance declined sharply as task complexity/value increased

I think this benchmark represents an important shift in how we evaluate LLMs for real-world applications. By tying performance directly to economic value, we can better understand the gap between current capabilities and practical utility. The low success rates suggest we need significant advances before LLMs can reliably handle professional software engineering tasks.

I think the inclusion of management-level decisions is particularly valuable, as it tests both technical understanding and strategic thinking. This could help guide development of more complete engineering assistance systems.

TLDR: New benchmark tests LLMs on real $1M+ worth of Upwork programming tasks. Current models struggle significantly, completing only ~10% of coding tasks and ~20% of management decisions.

Full summary is here. Paper here.

22

1

[R] Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention (submitted by Liang Wenfeng - DeepSeek) (old.reddit.com)

submitted 5 days ago by bot@lemmit.online to c/machinelearning@lemmit.online

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Nunki08 on 2025-02-18 10:39:29+00:00.

Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Y. X. Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, Wangding Zeng

Long-context modeling is crucial for next-generation language models, yet the high computational cost of standard attention mechanisms poses significant computational challenges. Sparse attention offers a promising direction for improving efficiency while maintaining model capabilities. We present NSA, a Natively trainable Sparse Attention mechanism that integrates algorithmic innovations with hardware-aligned optimizations to achieve efficient long-context modeling. NSA employs a dynamic hierarchical sparse strategy, combining coarse-grained token compression with fine-grained token selection to preserve both global context awareness and local precision. Our approach advances sparse attention design with two key innovations: (1) We achieve substantial speedups through arithmetic intensity-balanced algorithm design, with implementation optimizations for modern hardware. (2) We enable end-to-end training, reducing pretraining computation without sacrificing model performance. As shown in Figure 1, experiments show the model pretrained with NSA maintains or exceeds Full Attention models across general benchmarks, long-context tasks, and instruction-based reasoning. Meanwhile, NSA achieves substantial speedups over Full Attention on 64k-length sequences across decoding, forward propagation, and backward propagation, validating its efficiency throughout the model lifecycle.

arXiv:2502.11089 [cs.CL] :

23

1

[D] Finetuning ModernBERT is taking 3hrs (2 epochs) and 35gigs of vram. is it normal? (old.reddit.com)

submitted 5 days ago by bot@lemmit.online to c/machinelearning@lemmit.online

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Solaris1712 on 2025-02-18 01:22:30+00:00.

So additional details...

I'm using paperspace gradient instance with an A6000 48gb vram, 8vcpu, 45 gb ram.

My dataset is 9k samples of newsarticle text and labels.

The model i'm using is "answerdotai/ModernBERT-base" with a context length of 8192.

Initially, I was constantly getting OOM errors when I was trying to finetune using batchsize of 32 or 16. Then after experimenting, I saw that setting the batchsize 4 or less was the only way training started.

Even training one epoch is taking around 1h 31mins.

Is this normal?

This is my first time finetuning a model so I am a without reference or past experience. I was not expecting to see a 45mb csv file to fill up the entire vram when I set the batch size to 32 or 16.

Is it a pytorch bug or ???

edit - the dataset im using is a truncated version of "valurank/PoliticalBias_AllSides_Txt" which has about 19k data samples. I'm using a subset of that - about 9k samples.

24

1

[R] Where does In-context Learning Happen in LLMs? (NeurIPS 2024) (old.reddit.com)

submitted 5 days ago by bot@lemmit.online to c/machinelearning@lemmit.online

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Historical_Insect668 on 2025-02-17 04:27:03+00:00.

Abstract: Self-supervised large language models have demonstrated the ability to perform various tasks via in-context learning, but little is known about where the model locates the task with respect to prompt instructions and demonstration examples.

In this work, we attempt to characterize the region where large language models transition from recognizing the task to performing the task. Through a series of layer-wise context-masking experiments on GPTNEO2.7B, BLOOM3B, and STARCODER2-7B, LLAMA3.1-8B, LLAMA3.1-8B-INSTRUCT, on Machine Translation and Code generation, we demonstrate evidence of a "task recognition" point where the task is encoded into the input representations and attention to context is no longer necessary.

Taking advantage of this redundancy results in 45% computational savings when prompting with 5 examples, and task recognition achieved at layer 14 / 32 using an example with Machine Translation. Our findings also have implication for resource and parameter efficient fine-tuning; we observe a correspondence between fine-tuning performance of individual LoRA layers and the task recognition layers.

PaperLink, Code

25

1

[D] Visual explanation of "Backpropagation: Multivariate Chain Rule" (old.reddit.com)

submitted 5 days ago by bot@lemmit.online to c/machinelearning@lemmit.online

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/madiyar on 2025-02-17 19:16:15+00:00.

Hi,

I started working on visual explanation of backpropagation. Here is the part 1:. Please let me know what you think.

One part that confuses me about backpropagation is why people associate backpropagation to the chain rule ? The chain rule doesn't clearly explain when there are multiple paths from a parameter to the loss. Eventually I realized that I was missing the term "multivariate chain rule," and once I found it, everything clicked in my head. Let me know if you have thoughts here.

Thanks,