I would like to know how to obtain the list of versions of a certain package from the CRAN repository, so far what I have been able to observe is that with each package the uploaded versions are maintained, for example taking as a case shiny
, we can access this url
where we can see the versions uploaded, the date and eventually download that version. It occurred to me to process the HTML and get this data, but is there any other more direct way?
Patricio Moracho's questions
I need to share a data.frame
with a colleague, but would like to somehow "anonymize" the data. The idea would be:
- Nothing too advanced (I don't need to adhere to any standards or norms)
- not reversible
- only for chains
- simple and fast
Suppose some data like this:
df <- data.frame(nombre = c('Juan', 'Pedro'),
Edad = c(34, 45),
dni = c('12345678', '87654321'))
The idea would be to apply the algorithm only on the name and the ID.
Let's say I have a table like the following:
LineId Linea
------- -----------
1 Linea 1
2 Linea 2
3 Linea 3
4 Linea 4
And I'm looking to get an output like this:
LineId Linea
------- -----------
1 Linea 1,
2 Linea 1, Linea 2,
3 Linea 1, Linea 2, Linea 3,
4 Linea 1, Linea 2, Linea 3, Linea 4,
I think the idea is understood, cumulatively concatenate each line, and in the order given by LineId
(fundamental). A very rustic way to solve it would be to do something like this:
DECLARE @Temporal VARCHAR(8000)
UPDATE #Ejemplo
SET @Temporal = ISNULL(@Temporal,'') + Linea + ', ',
Linea = @Temporal
FROM #Ejemplo
But, this is where I wonder: since the tables don't have a natural order and you can't set a ORDER BY
in a statement UPDATE
either or at least I don't know how to do it, the previous statement, which works fine in the example, doesn't guarantee me the update order. Might as well be outputting something like this
LineId Linea
------- -----------
1 Linea 1,
2 Linea 1, Linea 3, Linea 2,
3 Linea 1, Linea 3,
4 Linea 1, Linea 3, Linea 2, Linea 4,
Additional details:
- The example is a test of the idea of the problem, it will work fine, possibly always
- In reality, I have a similar case, a legacy code, which processes sequential text files
- Erratically, cases are detected where the insertion order would not be maintained
- I am clear that there is no "insertion order", I am not complaining about the behavior of SQL Server, it is expected. What yes, this behavior began to be verified when changing a version of the engine (2008 to the next)
- The solution that I don't like, but it works, is to use cursors and update by row
- I'd like to see if there is a more elegant or natural way to solve it
- So far I have tried without much success: a) Add an identity to the table that represents the order, to see if the engine uses it by default b) Go through the generation of an XML, but so far I have not achieved the expected result.
To play the data
CREATE TABLE #Ejemplo (
LineId INT IDENTITY,
Linea VARCHAR(8000)
)
INSERT INTO #Ejemplo(Linea)
VALUES ('Linea 1'), ('Linea 2'), ('Linea 3'), ('Linea 4')
There is a behavior CASE
that has always produced a certain doubt in me. Normally if I see this code:
case when id = 1 then 1 else '' end
I usually change it to something like this:
case when id = 1 then 1 else 0 end
O well
case when id = 1 then 1 else NULL end
That is to say, I am generating a column that, a priori, would have a numerical value, so it does not seem consistent in ELSE
returning a string, even though it is a "blank", so it seems more consistent to me to return a numerical value or in any case a NULL
.
However, this code is fully functional, and the blank returned is somehow coerced to a 0.
select id,
case when id = 1 then 1 else '' end,
case when id = 2 then 1 else '' end
from (select 1 as id union
select 2
) T
+---+---+---+
| 1 | 1 | 0 |
+---+---+---+
| 2 | 0 | 1 |
+---+---+---+
However if instead of a blank ''
string we return another string:
select id,
case when id = 1 then 1 else 'no' end,
case when id = 2 then 1 else 'no' end
from (select 1 as id union
select 2
) T
Msg 245, Level 16, State 1, Line 73
Conversion failed when converting the varchar value 'no' to data type int.
What is this behavior due to? Is this documented somewhere?
I have a terminal with Windows 10 that came pre-installed with an ancient version of Java: Java 2 Runtime Environment, SE v1.4.2_04
along with the fact that I don't need it, there is the issue of the security hole that implies leaving it installed. I have tried to eliminate it in several ways since with the "Add Remove Programs" option I have not succeeded:
I have tried the following procedures without success:
JavaRa , is recommended in multiple forums and certainly automates many of the tasks related to Java maintenance .
Wise Program Unninstaller , or any of the many uninstall tools.
In general we are used to seeing that collection type objects, I am talking about lists, arrays, matrices, recordsets, or whatever they are called in any language, are "indexed" starting at position 0. However, in R
, for some decision By design, any objects (in fact R
there are no scalar objects) are "indexed" starting at 1.
> vector <- c(1,2,3)
> vector[1]
[1] 1
> vector[0]
numeric(0)
Historically speaking, what motivated this decision? Does it have any particular advantage over "indexing" starting at 0?
I am studying the package igraph
to solve a problem. Let's say I have the following topology:
Each node has possible paths, you always have to follow the direction of the arrows, you cannot go in the opposite direction, the nodes have an order, you can go from h4
to h5
but not to h3
, anyway this is only informative, because the topology already takes this into account. Nodes have an attribute, in this example, it is represented by color.
Finally, what I am looking for is to be able to find at least one path (ideally all of them), as short as possible, in such a way that, starting at any point, I can make sure I go through the three "colors" (or attributes) at least once. ).
Example: h1 -> h6 -> h9
, it is ideal since I go through the three colors in 3 steps, but it could also be valid if h1 -> h3 -> h4 -> h6
I repeat one of the colors but go through all three.
To reproduce this topology:
library(igraph)
nodos <- structure(list(Hito = structure(1:9, .Label = c("h1", "h2", "h3",
"h4", "h5", "h6", "h7", "h8", "h9"), class = "factor"), tipo = structure(c(1L,
2L, 3L, 1L, 3L, 2L, 1L, 2L, 3L), .Label = c("A", "B", "C"), class = "factor")), class = "data.frame", row.names = c(NA,
-9L))
topology <- structure(list(Node.1 = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 7L,
7L, 8L), .Label = c("h1", "h2", "h3", "h4", "h5", "h6", "h7",
"h8"), class = "factor"), Node.2 = structure(c(1L, 3L, 4L, 6L,
7L, 1L, 2L, 3L, 5L, 7L, 2L, 4L, 6L, 4L, 3L, 6L, 7L, 4L, 5L, 6L,
5L, 7L, 6L, 7L, 7L), .Label = c("h3", "h4", "h5", "h6", "h7",
"h8", "h9"), class = "factor")), class = "data.frame", row.names = c(NA,
-25L))
g <- graph.data.frame(topology, vertices=nodos, directed=TRUE)
V(g)$color <- c("#006699", "#CC0000", "#009933")[as.numeric(factor(V(g)$tipo))]
plot.igraph(g,
vertex.size = 20,
edge.arrow.size = 0.5,
vertex.label.font=2,
vertex.label.color="gray85",
vertex.label.cex=1.4,
edge.color="gray45",
layout=layout.kamada.kawai)
Let's assume we have a data.frame
like the following:
set.seed(2019)
datos <- data.frame(ANO1=sample(1:10, 10, replace = TRUE),
ANO2=sample(1:10, 10, replace = TRUE))
datos
ANO1 ANO2
1 8 8
2 8 7
3 4 3
4 7 2
5 1 7
6 1 7
7 9 1
8 1 8
9 2 4
10 7 5
What I am looking for is to create a matrix with the logical value of comparing if the two columns are less than a certain set of numbers, for example, let's say a range from 1 to 10, taking the first row as a reference, I would like to obtain something like this :
ANO1 ANO2 1 2 3 4 5 6 7 8 9 10
1 8 8 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
In this case 8
they are both less than 10
and 9
but not less than the rest of the values.
I am graphing a histogram from a very important set of data using geom_histogram()
, I have realized that as the "definition" of it increases by increasing the number of bins
bars, the result is slower and slower. The ratio, with a histogram of R base is at least 10 to 1 in time. Example:
library("ggplot2")
library("microbenchmark")
set.seed(2019)
x <- rnorm(100000)
df <- data.frame(x=x)
ggplot_hist <- function(data, bins=100000){
print(ggplot(data, aes(x=x)) + geom_histogram(bins=bins))
}
base_hist <- function(x, breaks=100000){
print(hist(x, breaks=length(x)))
}
microbenchmark(
base_hist(x),
ggplot_hist(df),
times=3L
)
Unit: seconds
expr min lq mean median uq max neval
base_hist(x) 4.503556 4.632358 4.680143 4.761159 4.768436 4.775713 3
ggplot_hist(df) 56.330033 57.249490 60.182923 58.168946 62.109369 66.049791 3
Is there a way to optimize a histogram in ggplot?
I would like to build a function like the following:
function(x, y, operador)
The idea is that it receives an operator, for example: +
, -
, *
. /
and can return the result of applying the same between x
ey
I am needing to build a data.frame
with words from the Spanish language (or at least a significant number of them), the idea is to use them to later "clean" others data.frame
, in order to remove patterns that do not correspond to valid words.
There is a resource in the RAE that is the Reference Corpus of Current Spanish (CREA) , it is a set of some 140,000 documents, made up of books, press material and others. On the other hand, the mentioned document talks about a Frequent Forms Report and I am particularly interested in working with the Total List of Frequencies , which according to what I understand, is a complete list of the words of this Corpus ordered by frequency.
The most specific query is: How can I incorporate this resource into a data.frame
? , and the other more general , is this a valid resource for what I'm looking for?
I have a monthly sales table like the following:
create table ventas (
id int NOT NULL AUTO_INCREMENT,
year int,
month int,
monto numeric(15,2),
PRIMARY KEY (id)
);
insert into ventas (year, month, monto) values (2018, 1, 100);
insert into ventas (year, month, monto) values (2018, 1, 300);
insert into ventas (year, month, monto) values (2018, 3, 340);
insert into ventas (year, month, monto) values (2018, 5, 200);
insert into ventas (year, month, monto) values (2018, 5, 100);
insert into ventas (year, month, monto) values (2018, 7, 100);
insert into ventas (year, month, monto) values (2018, 8, 100);
insert into ventas (year, month, monto) values (2018, 9, 200);
insert into ventas (year, month, monto) values (2018,11, 350);
insert into ventas (year, month, monto) values (2018,12, 440);
I am trying to make a report of these sales per month, I tried it like this:
select year,
month,
sum(monto) as total
from ventas
group by year,
month
And I get something like this:
| year | month | total |
|------|-------|-------|
| 2018 | 1 | 400 |
| 2018 | 3 | 340 |
| 2018 | 5 | 300 |
| 2018 | 7 | 100 |
| 2018 | 8 | 100 |
| 2018 | 9 | 200 |
| 2018 | 11 | 350 |
| 2018 | 12 | 440 |
Which is correct, but as can be seen there are "gaps", that is, months without values, I would like to have a report, but with the 12 months and complete those that have not had sales with a 0, that is to say something like this
| year | month | total |
|------|-------|-------|
| 2018 | 1 | 400 |
| 2018 | 2 | 0 |
| 2018 | 3 | 340 |
| 2018 | 4 | 0 |
| 2018 | 5 | 300 |
| 2018 | 6 | 0 |
| 2018 | 7 | 100 |
| 2018 | 8 | 100 |
| 2018 | 9 | 200 |
| 2018 | 10 | 0 |
| 2018 | 11 | 350 |
| 2018 | 12 | 440 |
Important
It may not be sales, or it may be some other type of data, it is important to understand the conceptual problem, which is, what do we do when we are missing information in a table? When we do not have sensor readings for all hours, when there are accounting accounts that do not register movements in certain months, when we want to list the sales of all the branches, but there are branches that have not had sales, when we want to know how many people occupied a room, but there are rooms that have never been occupied, etc.
repl.it
is an excellent, useful and for now free tool for executing R code online. However, I have difficulties using non-standard R packages. In fact, when I try to install something, the following happens:
install.packages("vegan")
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)
Warning in install.packages("vegan") :
'lib = "/usr/local/lib/R/site-library"' is not writable
Error in install.packages("vegan") : unable to install packages
In the documentation , it is only mentioned that it is possible to install other libraries or packages in the case of Python
, Javascript
or Ruby
.
Is there a way to be able to use packages outside of the base distribution in this tool?
Surely you have already encountered a problem like the following:
> (2.3 - 1.8) == 0.5
[1] FALSE
> sqrt(2)^2 == 2
[1] FALSE
The explanation to the general problem of handling floating point numbers can be found here: Why can't my programs do arithmetic calculations correctly? .
This is not a problem particular to R but to any language that handles floating point numbers.
Now, how can we resolve or handle these "inconsistencies" in the language when making comparisons?
In every language there are clauses to control the flow of execution, in R in particular I am talking about the if/else
, the while
and the repeat
. These are not very different from those that we can find in any other language, they evaluate a certain condition, depending on one Vedadero/Falso
of it, it will be the direction they take. But in R , there is a big little difference.
Being a purely vector language, there are no "scalar" data, although there are different types of data, this can only exist in a "container" (the most elementary is the vector), when we do it a = 1
in any language, we are assigning a space to store a single integer data, in R , it is the same, but with a subtle difference, an array of type integer is created, with a single element.
Now, with the evaluation of the conditions the same thing happens, it does not return a TRUE/FALSE
scalar, but rather a boolean vector, however the flow control, like the rest of the languages, is clearly "scalar", a single TRUE/FALSE
will determine the flow to follow So: How is this situation of needing a single piece of data for evaluation compatible in the language, when in reality the language does not have it?
Assuming I have a Django model like the following:
class Comprobante(models.Model):
punto_venta = models.IntegerField(blank=True)
And I want to validate the model and particularly that punto_venta
it is a value from 1 to 9999. I understand that there are two ways:
Use a validator in the field
def punto_venta_validate(value):
if not value:
raise ValidationError(_('El punto de venta es obligatorio'))
if value <1 or value > 9999:
raise ValidationError(_('El punto de venta debe debe ser un valor entre 1 y 9999'))
class Comprobante(models.Model):
punto_venta = models.IntegerField(blank=True, validators=[punto_venta_validate]))
Validate on clean()
model event
def clean(self):
punto_venta_validate(self.punto_venta)
The only visible difference is that a ValidationError
when the field is validated by using validators
, it will show the message next to the field in the admin interface, when we use clean()
to validate, I see that the error appears on all the fields. Eventually clean()
, we could also validate multiple conditions and add each error to a list, in this way we could show all the errors of each field, so there would not be a difference between the two methods either. So: What is the difference between both methods? Is it just a matter of how they are displayed ValidationError
or is there something else that I am missing?
I have a data.frame
with a certain structure:
ucba <- data.frame(UCBAdmissions)
ucba
Admit Gender Dept Freq
1 Admitted Male A 512
2 Rejected Male A 313
3 Admitted Female A 89
4 Rejected Female A 19
5 Admitted Male B 353
6 Rejected Male B 207
7 Admitted Female B 17
8 Rejected Female B 8
9 Admitted Male C 120
10 Rejected Male C 205
11 Admitted Female C 202
12 Rejected Female C 391
13 Admitted Male D 138
14 Rejected Male D 279
15 Admitted Female D 131
16 Rejected Female D 244
17 Admitted Male E 53
18 Rejected Male E 138
19 Admitted Female E 94
20 Rejected Female E 299
21 Admitted Male F 22
22 Rejected Male F 351
23 Admitted Female F 24
24 Rejected Female F 317
And I would like to reformulate it to the following form:
Dept Male/Admitted Male/Rejected Female/Admitted Female/Rejected
1 A 512 313 89 19
2 B 353 207 17 8
3 C 120 205 202 391
4 D 138 279 131 244
5 E 53 138 94 299
6 F 22 351 24 317
Basically:
- We group by department
- We summarize in columns the values of acceptance/rejection (
Admit
) and genderGender
. - Final output should be other
data.frame
and column names should be self explanatory
I've researched various options ( aggregate
and xtabs
) which so far are not entirely convincing to me.
When loading a package there are two ways to do it: library()
and require()
. What differences, if any, are there between the two methods?
Free translation and reworking of: What is the difference between require() and library()?
Whether it is when asking a question on this site or when we need to share an example with a colleague, what elements should we take into account to ensure the reproducibility of the example? (information, data, structures, etc.)
A few days ago I saw a question that, among other things, implied a problem similar to the one I am going to pose, I wanted to do it in a more general way, because I understand that the well-posed solution could serve as a reference to similar problems. Perhaps for some of you the answer is trivial or obvious, but in my case only when I chewed it enough I found (at least I think so) that it was simpler than I thought. I propose it in Sql but it could be more about algorithms, the issue is that it seemed more practical to be able to test the solutions.
Suppose the following example:
CREATE TABLE A (
NRO_DESDE INT,
NRO_HASTA INT
)
CREATE TABLE B (
NRO_DESDE INT,
NRO_HASTA INT
)
INSERT INTO A (NRO_DESDE, NRO_HASTA)
VALUES (5, 8)
INSERT INTO B (NRO_DESDE, NRO_HASTA)
VALUES (1, 2), (4, 5), (5, 8), (6, 7), (7, 9), (4, 10), (9, 11)
SELECT NRO_DESDE, NRO_HASTA FROM A;
SELECT NRO_DESDE, NRO_HASTA FROM B;
The table A
has a single value:
NRO_DESDE NRO_HASTA
========= =========
5 8
The boardB
NRO_DESDE NRO_HASTA
========= =========
1 2
4 5
5 8
6 7
7 9
4 10
9 11
The tables A
and B
represent sets of intervals, but for which we do not have all the values but rather we know the first and last element of each set, the idea is to compare the only set in A
with all of B
and determine if they share any element. As an example, the record B
(4, 5)
shares the 5
with A
, the (1, 2)
does not share any element, the (7, 9)
share the 7, 8
. The result would then be the records of B
that have elements shared with those of A
, it is not important to know what they are, just to know that there are, we can also assume that the number of elements in each set is relatively manageable. Don't worry about missing primary keys, it's just a conceptual example.
Note: The code is built in SQL Server but could be resolved in any "flavor" of SQL.