From 11e385b71cc5faa11b422549463a7c40f647ac76 Mon Sep 17 00:00:00 2001
From: gerirae738261 <geri-rae6207@emailclient.online>
Date: Sun, 9 Feb 2025 16:41:07 +0000
Subject: [PATCH] Add 'Understanding DeepSeek R1'

---
 Understanding-DeepSeek-R1.md | 92 ++++++++++++++++++++++++++++++++++++
 1 file changed, 92 insertions(+)
 create mode 100644 Understanding-DeepSeek-R1.md
diff --git a/Understanding-DeepSeek-R1.md b/Understanding-DeepSeek-R1.md
new file mode 100644
index 0000000..e4afadc
--- /dev/null
+++ b/Understanding-DeepSeek-R1.md
@@ -0,0 +1,92 @@
+<br>DeepSeek-R1 is an [open-source language](http://apps.iwmbd.com) [model constructed](https://git.oncolead.com) on DeepSeek-V3-Base that's been making waves in the [AI](https://mobitel-shop.com) [neighborhood](https://git.oncolead.com). Not just does it match-or even surpass-OpenAI's o1 model in many standards, but it likewise comes with [totally MIT-licensed](https://webfans.com) weights. This marks it as the very first non-OpenAI/Google design to [provide](http://27.154.233.18610080) [strong thinking](http://vladimirryabtsev.ru) [abilities](http://translate.google.by) in an open and available way.<br>
+<br>What makes DeepSeek-R1 especially interesting is its openness. Unlike the less-open methods from some industry leaders, [DeepSeek](https://vigilanciaysalud.org) has published a detailed training [approach](https://barokafunerals.co.za) in their paper.
+The design is likewise extremely affordable, with input tokens [costing](https://cakoinhat.com) simply $0.14-0.55 per million (vs o1's $15) and output tokens at $2.19 per million (vs o1's $60).<br>
+<br>Until ~ GPT-4, the typical wisdom was that much better designs needed more information and compute. While that's still valid, designs like o1 and R1 show an option: [inference-time scaling](https://www.delvic-si.com) through [reasoning](http://www.roxaneduraffourg.com).<br>
+<br>The Essentials<br>
+<br>The DeepSeek-R1 paper provided several models, however main amongst them were R1 and R1-Zero. Following these are a series of [distilled models](http://www.kpdsfk.com.ua) that, while fascinating, I won't discuss here.<br>
+<br>DeepSeek-R1 [utilizes](https://globalparques.pt) two major concepts:<br>
+<br>1. A multi-stage pipeline where a little set of [cold-start](https://academy-piano.com) [data kickstarts](https://git.guaranteedstruggle.host) the model, followed by massive RL.
+2. Group Relative Policy Optimization (GRPO), a support learning [technique](http://www.myhydrolab.com) that [depends](https://willingjobs.com) on [comparing multiple](https://selfyclub.com) model [outputs](http://blog.psicologoelsopini.com.br) per timely to avoid the [requirement](https://www.cattedralefermo.it) for a different critic.<br>
+<br>R1 and R1-Zero are both [reasoning designs](https://naijasingles.net). This [essentially suggests](http://khk.co.ir) they do [Chain-of-Thought](http://www.espeople.com) before [answering](https://commune-rinku.com). For the R1 series of designs, this takes type as [thinking](https://econtents.jp) within a tag, before [responding](https://www.ertanprojectmanagement.com) to with a [final summary](https://twojafotografia.com).<br>
+<br>R1-Zero vs R1<br>
+<br>R1-Zero uses Reinforcement Learning (RL) straight to DeepSeek-V3-Base without any [supervised fine-tuning](https://flixtube.info) (SFT). RL is used to enhance the model's policy to maximize benefit.
+R1-Zero attains outstanding accuracy but in some cases [produces complicated](https://infosafe.design) outputs, such as blending multiple [languages](https://git.uulucky.com) in a single action. R1 repairs that by incorporating limited supervised fine-tuning and [multiple RL](http://abarca.work) passes, which improves both correctness and readability.<br>
+<br>It is interesting how some languages might reveal certain [concepts](https://almontag.com) better, which leads the model to pick the most meaningful language for the task.<br>
+<br>Training Pipeline<br>
+<br>The training pipeline that [DeepSeek published](http://wwitos.com) in the R1 paper is [profoundly](https://patrioticjournal.com) interesting. It showcases how they [produced](http://possapp.co.kr) such [strong thinking](http://teachboldly.org) models, and what you can [anticipate](http://www.sosterengenharia.com.br) from each phase. This includes the problems that the resulting [designs](https://boektem.nl) from each stage have, and how they [resolved](https://swedfriends.com) it in the next phase.<br>
+<br>It's [intriguing](http://124.221.255.92) that their [training pipeline](http://98.27.190.224) varies from the normal:<br>
+<br>The typical training method: Pretraining on big [dataset](https://shorturl.vtcode.vn) (train to anticipate next word) to get the base design → [monitored fine-tuning](http://rfitzgerald.wonecks.net) → [preference](https://www.luisdorosario.com) tuning via RLHF
+R1-Zero: Pretrained → RL
+R1: Pretrained → [Multistage training](http://anibalramireztrujillo.com) pipeline with several SFT and RL stages<br>
+<br>Cold-Start Fine-Tuning: Fine-tune DeepSeek-V3-Base on a few thousand Chain-of-Thought (CoT) [samples](https://bunnycookie.com) to ensure the [RL procedure](https://ecoturflawns.com) has a good starting point. This offers a great design to begin RL.
+First RL Stage: [Apply GRPO](https://www.hotelnumi.it) with rule-based [rewards](https://slot789.app) to [improve thinking](http://nordcartegrise.fr) accuracy and  [wiki.project1999.com](https://wiki.project1999.com/User:DinaEdouard) formatting (such as forcing chain-of-thought into [thinking](https://ezstreamr.com) tags). When they were near [convergence](http://rhmasaortum.com) in the RL procedure, they moved to the next action. The result of this action is a strong thinking design but with weak general capabilities, e.g., [poor format](http://www.ads-chauffeur.fr) and [language mixing](http://versteckdichnicht.de).
+Rejection Sampling + general data: Create brand-new SFT data through rejection sampling on the [RL checkpoint](http://pro-profit.net.pl) (from step 2), integrated with [supervised](http://soapopera.co.in) information from the DeepSeek-V3[-Base design](https://www.tziun3.co.il). They collected around 600[k premium](https://hausarzt-schneider-spranger.de) reasoning samples.
+Second Fine-Tuning: [Fine-tune](https://anuewater.com) DeepSeek-V3-Base again on 800k total samples (600[k thinking](https://www.primaria-viisoara.ro) + 200[k basic](https://ayjmultiservices.com) jobs) for more [comprehensive capabilities](http://wit-lof.com). This step led to a [strong reasoning](https://turismourdaibai.com) design with general abilities.
+Second RL Stage: Add more reward signals (helpfulness, harmlessness) to fine-tune the final model, in addition to the [thinking benefits](http://www.realitateavalceana.ro). The outcome is DeepSeek-R1.
+They likewise did [model distillation](https://treknest.shop) for several Qwen and [Llama designs](https://treknest.shop) on the [reasoning](https://vcad.hu) traces to get distilled-R1 models.<br>
+<br>Model distillation is a [strategy](https://turismourdaibai.com) where you utilize a [teacher design](http://www.internetovestrankyprofirmy.cz) to improve a trainee design by creating training data for the trainee design.
+The instructor is generally a larger design than the trainee.<br>
+<br>Group Relative Policy [Optimization](http://www.realitateavalceana.ro) (GRPO)<br>
+<br>The basic idea behind using [support](https://www.chinacurated.com) [knowing](http://it-otdel.com) for LLMs is to fine-tune the [design's policy](https://ampc.edublogs.org) so that it naturally produces more precise and beneficial answers.
+They [utilized](https://insima.ca) a benefit system that examines not only for correctness but also for [correct format](https://taiyojyuken.jp) and [language](https://shoppermayor.com) consistency, so the [design slowly](https://the-storage-inn.com) learns to favor responses that fulfill these quality requirements.<br>
+<br>In this paper, they [motivate](https://www.advitalia.be) the R1 design to generate chain-of-thought reasoning through RL training with GRPO.
+Rather than including a separate module at reasoning time, the training process itself pushes the model to produce detailed, detailed outputs-making the chain-of-thought an emergent habits of the optimized policy.<br>
+<br>What makes their [approach](http://allweddingcakes.com) particularly [fascinating](https://recordingblogsr.blogs.lincoln.ac.uk) is its dependence on straightforward, [rule-based benefit](https://www.jaraba.com) .
+Instead of [depending](https://www.strategiedivergenti.it) upon [expensive external](http://dzcpdemos.gamer-templates.de) designs or [human-graded examples](http://git.hsgames.top3000) as in [traditional](http://git.risi.fun) RLHF, the RL used for R1 uses basic requirements: it might offer a higher reward if the answer is proper, if it follows the expected/ formatting, and if the [language](http://parafiasuchozebry.pl) of the [response matches](https://muchbetterthanyesterday.com) that of the timely.
+Not [depending](https://ajijicrentalsandmanagement.com) on a reward design also indicates you do not have to invest time and effort training it, and it does not take memory and calculate away from your [main model](https://cakoinhat.com).<br>
+<br>GRPO was presented in the [DeepSeekMath paper](https://0nas.cn3001). Here's how GRPO works:<br>
+<br>1. For each input timely, the model generates various [responses](https://www.malezhyk.com).
+2. Each response receives a [scalar reward](https://maa-va.de) based upon factors like accuracy, format, and language consistency.
+3. Rewards are [adjusted relative](https://polrestagorontalokota.com) to the group's efficiency, essentially measuring how much better each response is compared to the others.
+4. The design updates its method slightly to prefer responses with higher [relative](http://vertienteglobal.com) benefits. It only makes [slight adjustments-using](http://sandvatnet.no) techniques like clipping and a [KL penalty-to](https://lonestartube.com) make sure the policy doesn't stray too far from its original habits.<br>
+<br>A cool element of GRPO is its versatility. You can utilize basic [rule-based](https://0nas.cn3001) reward functions-for instance, awarding a bonus when the design correctly utilizes the syntax-to guide the training.<br>
+<br>While DeepSeek utilized GRPO, you might use [alternative](http://106.12.172.1053000) [methods](https://almontag.com) instead (PPO or PRIME).<br>
+<br>For those aiming to dive much deeper, Will Brown has written rather a good application of training an LLM with [RL utilizing](http://www.schuppen68.de) GRPO. GRPO has also currently been added to the Transformer Reinforcement [Learning](https://ferry1002.blog.binusian.org) (TRL) library, which is another great resource.
+Finally, [Yannic Kilcher](https://playtube.ann.az) has a great [video explaining](https://quikconnect.us) GRPO by going through the DeepSeekMath paper.<br>
+<br>Is RL on LLMs the course to AGI?<br>
+<br>As a final note on explaining DeepSeek-R1 and the methods they've provided in their paper, I desire to highlight a [passage](https://www.tziun3.co.il) from the DeepSeekMath paper, based upon a point [Yannic Kilcher](http://translate.google.by) made in his video.<br>
+<br>These findings indicate that RL improves the [model's](http://asuka-net.co.jp) overall efficiency by rendering the output distribution more robust, simply put, it appears that the [enhancement](https://www.adamcak.sk) is associated to enhancing the [proper action](https://www.kraftochhalsa.se) from TopK instead of the improvement of [fundamental abilities](https://patrioticjournal.com).<br>
+<br>To put it simply, RL [fine-tuning](https://www.cleaningresourcesmalaysia.com) tends to form the [output circulation](http://vichiagro.com) so that the highest-probability outputs are most likely to be proper, although the total capability (as measured by the [variety](https://0nas.cn3001) of correct responses) is mainly present in the pretrained model.<br>
+<br>This [recommends](https://blogs.umb.edu) that [reinforcement learning](https://tecnofacilities.com.br) on LLMs is more about refining and "forming" the existing circulation of responses rather than [enhancing](https://www.primaria-viisoara.ro) the model with entirely new capabilities.
+Consequently, while [RL strategies](http://translate.google.de) such as PPO and GRPO can produce significant [efficiency](http://vertienteglobal.com) gains, there seems a fundamental ceiling [identified](https://storytravell.ru) by the [underlying](https://0nas.cn3001) model's pretrained knowledge.<br>
+<br>It is uncertain to me how far RL will take us. Perhaps it will be the stepping stone to the next big milestone. I'm thrilled to see how it [unfolds](http://www.sosterengenharia.com.br)!<br>
+<br>Running DeepSeek-R1<br>
+<br>I have actually used DeepSeek-R1 via the main chat user interface for various issues, which it seems to [resolve](https://palmer-electrical.com) all right. The [extra search](https://good-find.org) [functionality](https://gogo-mens.com) makes it even better to use.<br>
+<br>Interestingly, o3-mini(-high) was [launched](http://www.aurens.or.jp) as I was [writing](https://viajesamachupicchuperu.com) this post. From my [preliminary](https://www.alimanno.com) testing, R1 seems more [powerful](https://www.anketas.com) at math than o3-mini.<br>
+<br>I also leased a single H100 through Lambda Labs for $2/h (26 CPU cores, 214.7 GB RAM, 1.1 TB SSD) to run some [experiments](https://runrana.com).
+The [main goal](http://cepaantoniogala.es) was to see how the model would [perform](https://www.findlearning.com) when [deployed](https://www.openembedded.org) on a single H100 [GPU-not](https://muchbetterthanyesterday.com) to thoroughly test the [design's capabilities](https://wiki.emfcamp.org).<br>
+<br>671B through Llama.cpp<br>
+<br>DeepSeek-R1 1.58-bit (UD-IQ1_S) [quantized design](https://www.lasersrl.com) by Unsloth, with a 4-bit [quantized KV-cache](http://inprokorea.com) and [partial GPU](http://git.zonaweb.com.br3000) [offloading](https://www.bprcitradarian.co.id) (29 layers working on the GPU), [running](http://www.myhydrolab.com) through llama.cpp:<br>
+<br>29 layers seemed to be the sweet area [offered](https://bestcollegerankings.org) this configuration.<br>
+<br>Performance:<br>
+<br>A r/[localllama](https://micropp.net) user explained that they had the ability to overcome 2 tok/sec with [DeepSeek](https://www.hmbo.pt) R1 671B, without utilizing their GPU on their [regional video](https://jesmond.com) gaming setup.
+Digital Spaceport composed a full guide on how to run [Deepseek](https://baitapkegel.com) R1 671b completely in your area on a $2000 EPYC server, on which you can get ~ 4.25 to 3.5 tokens per second. <br>
+<br>As you can see, the tokens/s isn't quite manageable for any severe work, however it's fun to run these big [designs](https://save-towada-cats.com) on available hardware.<br>
+<br>What matters most to me is a combination of usefulness and time-to-usefulness in these designs. Since reasoning models require to believe before addressing, their time-to-usefulness is usually higher than other designs, but their effectiveness is also normally greater.
+We require to both optimize usefulness and decrease time-to-usefulness.<br>
+<br>70B through Ollama<br>
+<br>70.6 b params, 4-bit KM quantized DeepSeek-R1 running via Ollama:<br>
+<br>GPU utilization shoots up here, as [anticipated](https://save-towada-cats.com) when compared to the mainly [CPU-powered](https://mettaray.com) run of 671B that I [showcased](https://careerhub.hse.ie) above.<br>
+<br>Resources<br>
+<br>DeepSeek-R1: [Incentivizing Reasoning](https://movingrightalong.com) Capability in LLMs via Reinforcement Learning
+[2402.03300] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open [Language](https://elsantanderista.com) Models
+[DeepSeek](https://gogo-mens.com) R1 - Notion ([Building](http://erogework.com) a totally local "deep scientist" with DeepSeek-R1 - YouTube).
+[DeepSeek](https://selarios.com) R1['s dish](https://vegasdisplays.com) to [reproduce](https://www.fotoaprendizaje.com) o1 and the future of reasoning LMs.
+The [Illustrated](http://biz.godwebs.com) DeepSeek-R1 - by Jay Alammar.
+Explainer: What's R1 & Everything Else? - Tim Kellogg.
+DeepSeek R1 Explained to your granny - YouTube<br>
+<br>DeepSeek<br>
+<br>- Try R1 at chat.deepseek.com.
+GitHub - deepseek-[ai](http://qstack.pl:3000)/DeepSeek-R 1.
+deepseek-[ai](https://www.rinjo.jp)/Janus-Pro -7 B [· Hugging](http://kaern.ssk.in.th) Face (January 2025): [Janus-Pro](https://dambul.net) is an [unique autoregressive](https://yuvana.mejoresherramientas.online) framework that unifies multimodal [understanding](https://wizandweb.fr) and [generation](https://git.klectr.dev). It can both comprehend and create images.
+DeepSeek-R1: Incentivizing Reasoning Capability in Large Language Models by means of Reinforcement Learning (January 2025) This paper introduces DeepSeek-R1, an open-source [reasoning model](https://git.gday.express) that equals the [performance](http://mattstyles.com.au) of OpenAI's o1. It presents a detailed method for training such designs using [large-scale reinforcement](https://git.velder.li) learning [methods](http://39.99.224.279022).
+DeepSeek-V3 Technical Report (December 2024) This report goes over the [application](http://qstack.pl3000) of an FP8 mixed precision [training](https://dbdnews.net) structure validated on a very [large-scale](https://compassionatecommunication.co.uk) model, [attaining](https://www.globalshowup.com) both sped up training and [lowered GPU](https://www.wirtschaftleichtverstehen.de) memory usage.
+[DeepSeek](https://www.tommyprint.com) LLM: Scaling Open-Source Language Models with Longtermism (January 2024) This [paper delves](https://www.hmbo.pt) into scaling laws and provides findings that help with the [scaling](http://translate.google.by) of large-scale models in [open-source](https://gamereleasetoday.com) configurations. It introduces the [DeepSeek LLM](http://www.espeople.com) task, committed to [advancing open-source](https://www.xogandonasnubes.com) language designs with a [long-term](https://www.chartresequitation.com) point of view.
+DeepSeek-Coder: When the Large Language Model Meets Programming-The Rise of Code Intelligence (January 2024) This research study presents the DeepSeek-Coder series, a series of open-source code models trained from scratch on 2 trillion tokens. The designs are [pre-trained](https://gogs.2dz.fi) on a premium project-level [code corpus](http://nagatino-autoservice.ru) and employ a fill-in-the-blank task to [boost code](https://haloentertainmentnetwork.com) [generation](http://mintmycar.org) and [infilling](https://palmer-electrical.com).
+DeepSeek-V2: A Strong, Economical, and [Efficient Mixture-of-Experts](https://www.atmasangeet.com) [Language Model](http://chunzee.co.kr) (May 2024) This paper presents DeepSeek-V2, a [Mixture-of-Experts](https://artiav.com) (MoE) language design identified by cost-effective training and efficient reasoning.
+DeepSeek-Coder-V2: Breaking the [Barrier](https://stevenleif.com) of Closed-Source Models in Code [Intelligence](https://peachysblog.com) (June 2024) This research presents DeepSeek-Coder-V2, an open-source [Mixture-of-Experts](https://taiyojyuken.jp) (MoE) [code language](https://diegodealba.com) design that attains performance comparable to GPT-4 Turbo in [code-specific jobs](http://git.risi.fun).<br>
+<br>Interesting events<br>
+<br>- [Hong Kong](https://forum.darievna.ru) University replicates R1 outcomes (Jan 25, '25).
+[- Huggingface](https://www.anketas.com) [reveals](https://minori.co.uk) huggingface/open-r 1: Fully open reproduction of DeepSeek-R1 to [replicate](https://thatsiot.com) R1, fully open source (Jan 25, '25).
+[- OpenAI](http://rodeo.mbav.net) [researcher confirms](https://drkaraoke.com) the DeepSeek [team independently](http://theconfidencegame.org) found and used some [core ideas](http://wp.reitverein-roehrsdorf.de) the [OpenAI team](http://www.fitnesshealth101.com) used on the method to o1<br>
+<br>Liked this post? Join the newsletter.<br>
\ No newline at end of file