It is easy to under sample and get lulled into assuming you now know something that you can extrapolate into the future, and then get disappointed when that performance doesn't repeat the next day. It happens all the time and most folks don't talk about those test groups, only their good ones.
One of the roles I played before retiring was to review "research and design" work, either risky jobs or work that was already in trouble. I also reviewed testing specification standards for things like medical tests and material tests, and testing things like ammunition and ordnance.
It was often the case that the expense of the testing needed versus the actual budgets didn't line up so folks said they "did the best they could" and tried to go with too few tests or the wrong tests. My job was then to evaluate the risks that what they were saying would work or fail, but I had to answer to politicians and lawyers, so there were many things to consider.
Engineers, doctors, scientists, etc., want to test forever, politicians and managers don't want to spend a penny. What testing and how much testing is the right amount takes a study and folks have to set a level of risk or probability we are right or wrong.
With some topics, there are contractual standards for the reliability. Words about reliability like P=0.9995 meant that there could no more than 5 failures in 10,000 trials, for example. So that would be a very high standard for real world specifications. Other times, you had terms like L10 of 1000 hours, which wasn't all that great for things where you are risking your life.... one in ten could fail and that was "okay"!?!
With side by side testing for things like primer changes or neck tension, it takes a little study to know how much testing is enough testing to judge the risk you actually are right or wrong. If it is okay to be wrong 5 times in 10, then it isn't very difficult and doesn't take many trials. If you have to promise somebody the answer is one they can bet their life on, then it takes more samples and tests.
Here is a short paper that I think does a decent job of explaining a simple question without giving the no-maths a headache. If you get through the whole thing you can see how easy it is to find a group size can double and yet still be normal within a recipe.
It is very easy to be misled by small samples of things that are known to have what we call a dispersive distribution behavior. Many times on the forum, the concepts of sample size versus, SD, ES, etc., get muddled. Just because we can take a few numbers and calculate an SD/ES, doesn't mean we know anything about the next day.
Many folks will tell you to test just one thing at a time and that will keep you safe... that only works with very simple relationships like the ones you can describe with three term equations. Very few would claim there are only two other things at work when it comes to shooting performance, so be careful.
The second paper shows some of the dangers of OFAT (One Factor At A Time) testing. Again, when it comes to optimizing a load recipe, one change like a primer might require a rebalancing of other factors to show the real value, or it may just fail no matter what. It takes some luck to get an easy obvious answer, but just remember that it is risky to judge things like primers without a full look.
If you don't have a background in research or testing, just pinch your nose and power through that paper. You will still get the point about how difficult it can be to play with a single variable when things are complicated. They use tomatoes but it only takes a little imagination to see it as shooting accuracy. Till I find a better paper this one will have to do.
I am curious if someone has tested neck tension or primers on multiple days and arrived at same conclusion.
Yes, you just have to be carful not to misjudge turning a single dial on something that might require adjusting the other ones to see the complete answer.
The best primer tests are the ones where the outcome is unaffected by a swap. However, it is important to know that if we get a change, that might mean you don't know the whole answer unless you tune to the new primer. Sometimes, you still get "no" for an answer and your old primers were better.
As a beginner, there is nothing wrong with starting by testing one factor at a time, like a primer change or neck tension change. Just be aware it may take more samples and multiple trials to get a definitive answer, and that answer may just mean that you need to look deeper to know if the new primer is really better or worse.
Curious if they are a fixed variable once determined.
For some weapon systems, take an M-16 for example, there is an ammo specification standard for the "neck tension" because it contains concepts like sealant and crimping. The low ends were for safety in automatics, and the high ends were for performance. Accuracy testing was certainly affected by those concepts so you should set yourself up with a little test some day and go high, medium, low to see if your system has a preference.
Cartridge brass has a yield point that can be hit within the bullet seating and neck sizing process, so each diameter and neck thickness has a zone where it will be more or less difficult to control in the "reloading" context. Be careful not to over work your brass with extremes or when experimenting with very light neck tension. YMMV